Bug 1665995
| Summary: | Huge latency spike for VM(s) with 1 rt vCPU | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 8 | Reporter: | Pei Zhang <pezhang> |
| Component: | kernel-rt | Assignee: | Juri Lelli <jlelli> |
| kernel-rt sub component: | Other | QA Contact: | Pei Zhang <pezhang> |
| Status: | CLOSED CURRENTRELEASE | Docs Contact: | |
| Severity: | high | ||
| Priority: | unspecified | CC: | bhu, chayang, hhuang, juzhang, lcapitulino, lgoncalv, mtosatti, pezhang, virt-maint |
| Version: | 8.1 | Keywords: | Regression |
| Target Milestone: | rc | Flags: | bhu:
mirror+
|
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | No Doc Update | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2019-06-03 15:56:48 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | 1666562, 1666680, 1669294 | ||
| Bug Blocks: | 1640832, 1649535, 1701002 | ||
|
Description
Pei Zhang
2019-01-14 15:58:51 UTC
I checked the difference between rhel7 and rhel8 hosts today: In rhel7.7 host: # cat /usr/lib/tuned/realtime-virtual-host/lapic_timer_adv_ns 2000 # cat /sys/module/kvm/parameters/lapic_timer_advance_ns 2000 In rhel8.0 host: # cat /usr/lib/tuned/realtime-virtual-host/lapic_timer_adv_ns cat: /usr/lib/tuned/realtime-virtual-host/lapic_timer_adv_ns: No such file or directory # cat /sys/module/kvm/parameters/lapic_timer_advance_ns 0 Luiz, could you please have a look if this is a possible reason of spike? Thanks. Best regards, Pei Pei, Yes, I'll look into this later today or tomorrow. I'll have to use your machine (but I already have the login details). I took a very quick look and it seems we have two problems: 1. Sometime after booting, the kernel decides that the TSC on CPU19 is unstable and changes the clocksource to hpet. I didn't know the kernel could mark the TSC of a single CPU as unstable, but in any case KVM-RT requires a stable TSC (and you're using CPU19 on your KVM-RT guest) [ 2611.752700] clocksource: timekeeping watchdog on CPU19: Marking clocksource 'tsc' as unstable because the skew is too large: [ 2611.765227] clocksource: 'hpet' wd_now: b5059197 wd_last: bbcfe69d mask: ffffffff [ 2611.775703] clocksource: 'tsc' cs_now: 5d733215270 cs_last: 218c2d31186 mask: ffffffffffffffff [ 2611.787464] tsc: Marking TSC unstable due to clocksource watchdog [ 2611.787472] TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'. [ 2611.787475] sched_clock: Marking unstable (2611880509082, -95077647)<-(2611847545715, -60076864) [ 2611.822971] clocksource: Switched to clocksource hpet 2. tuned is failing when applying the realtime-virtual-host profile: 2019-01-14 09:01:12,100 ERROR tuned.plugins.plugin_script: script '/usr/lib/tuned/realtime-virtual-host/script.sh' error output: 'Failed to start rt-entsk.service: Unit rt-entsk.service not found.' 2019-01-14 09:01:12,156 ERROR tuned.units.manager: BUG: Unhandled exception in start_tuning: dictionary changed size during iteration Although the TSC issue seems quite serious and impactful for KVM-RT, I can't tell with 100% certainty that one or both issues are the root cause of the spike. We'll have to fix them to know it. I'll open a different BZ for issue 2 and will follow up with instructions for issue 1. Pei, For issue 1, I think you said you were using this machine with a RHEL7 kernel and you didn't have any issues, right? Do you remember if you were using CPU19 (ie. if a qemu thread was pinned to that CPU). Maybe Marcelo will have better recommendations, but I think I'd first try to determine if we need a guest to cause this issue. One way of doing this is to just reboot the system and wait 1 hour or 2. Then check if the error is in dmesg. If it is there, then we don't need a guest and you could try the same procedure with the non-RT kernel to see if this problem is specific to RT or not. (In reply to Luiz Capitulino from comment #3) > > 2. tuned is failing when applying the realtime-virtual-host profile: > > 2019-01-14 09:01:12,100 ERROR tuned.plugins.plugin_script: script > '/usr/lib/tuned/realtime-virtual-host/script.sh' error output: 'Failed to > start rt-entsk.service: Unit rt-entsk.service not found.' > 2019-01-14 09:01:12,156 ERROR tuned.units.manager: BUG: Unhandled > exception in start_tuning: dictionary changed size during iteration > > Although the TSC issue seems quite serious and impactful for KVM-RT, I can't > tell with 100% certainty that one or both issues are the root cause of the > spike. We'll have to fix them to know it. > > I'll open a different BZ for issue 2 and will follow up with instructions > for issue 1. Hi Luiz, I've filed bug 1666562 to track it. (In reply to Luiz Capitulino from comment #4) > Pei, > > For issue 1, I think you said you were using this machine with a > RHEL7 kernel and you didn't have any issues, right? Do you remember > if you were using CPU19 (ie. if a qemu thread was pinned to that CPU). Yes, the RHEL7.7 works well. Yes, I was using CPU19 as guest housekeeping cpu in scenario "Single VM with 1 rt vCPU". We test RHEL7.7 and RHEL8 with same configurations, including cores isolation, cpu pin. I checked history testing of RHEL8. (As the spike looks very big, so I ran several runs to double confirm before reporting bug). ==For scenario "Single VM with 1 rt vCPU"== Server: dell-per430-09.lab.eng.pek2.redhat.com TSC related log: "clocksource: timekeeping watchdog on CPU19..." Running: 3 times, all 3 runs hit spike and showed above TSC log. CPU pin info: <cputune> <vcpupin vcpu='0' cpuset='19'/> <vcpupin vcpu='1' cpuset='18'/> <emulatorpin cpuset='1,3,5,7,9'/> <vcpusched vcpus='0' scheduler='fifo' priority='1'/> <vcpusched vcpus='1' scheduler='fifo' priority='1'/> </cputune> ==For scenario "Multiple VMs each with 1 rt vCPU"== Server: dell-per430-11.lab.eng.pek2.redhat.com TSC related log: "clocksource: timekeeping watchdog on CPU13:.." "clocksource: timekeeping watchdog on CPU15:.." "clocksource: timekeeping watchdog on CPU19:.." Running: 4 times, all 4 runs hit spike and showed one of above TSC logs. CPU pin info: -VM1 <cputune> <vcpupin vcpu='0' cpuset='19'/> <vcpupin vcpu='1' cpuset='18'/> <emulatorpin cpuset='1,3,5,7,9'/> <vcpusched vcpus='0' scheduler='fifo' priority='1'/> <vcpusched vcpus='1' scheduler='fifo' priority='1'/> </cputune> -VM2 <cputune> <vcpupin vcpu='0' cpuset='17'/> <vcpupin vcpu='1' cpuset='16'/> <emulatorpin cpuset='1,3,5,7,9'/> <vcpusched vcpus='0' scheduler='fifo' priority='1'/> <vcpusched vcpus='1' scheduler='fifo' priority='1'/> </cputune> -VM3 <cputune> <vcpupin vcpu='0' cpuset='15'/> <vcpupin vcpu='1' cpuset='14'/> <emulatorpin cpuset='1,3,5,7,9'/> <vcpusched vcpus='0' scheduler='fifo' priority='1'/> <vcpusched vcpus='1' scheduler='fifo' priority='1'/> </cputune> -VM4 <cputune> <vcpupin vcpu='0' cpuset='13'/> <vcpupin vcpu='1' cpuset='12'/> <emulatorpin cpuset='1,3,5,7,9'/> <vcpusched vcpus='0' scheduler='fifo' priority='1'/> <vcpusched vcpus='1' scheduler='fifo' priority='1'/> </cputune> For both scenarios: Host cores isolation: BOOT_IMAGE=(hd0,msdos1)/vmlinuz-4.18.0-60.rt9.112.el8.x86_64 root=/dev/mapper/rhel_dell--per430--11-root ro crashkernel=auto resume=/dev/mapper/rhel_dell--per430--11-swap rd.lvm.lv=rhel_dell-per430-11/root rd.lvm.lv=rhel_dell-per430-11/swap console=ttyS0,115200n81 skew_tick=1 isolcpus=2,4,6,8,10,12,14,16,18,19,17,15,13 intel_pstate=disable nosoftlockup nohz=on nohz_full=2,4,6,8,10,12,14,16,18,19,17,15,13 rcu_nocbs=2,4,6,8,10,12,14,16,18,19,17,15,13 default_hugepagesz=1G iommu=pt intel_iommu=on Guest cores isolation: BOOT_IMAGE=(hd0,msdos1)/vmlinuz-4.18.0-60.rt9.112.el8.x86_64 root=/dev/mapper/rhel_bootp--73--75--106-root ro console=tty0 console=ttyS0,115200n8 biosdevname=0 crashkernel=auto resume=/dev/mapper/rhel_bootp--73--75--106-swap rd.lvm.lv=rhel_bootp-73-75-106/root rd.lvm.lv=rhel_bootp-73-75-106/swap skew_tick=1 isolcpus=1 intel_pstate=disable nosoftlockup nohz=on nohz_full=1 rcu_nocbs=1 default_hugepagesz=1G iommu=pt intel_iommu=on > > Maybe Marcelo will have better recommendations, but I think I'd first > try to determine if we need a guest to cause this issue. One way of > doing this is to just reboot the system and wait 1 hour or 2. Then > check if the error is in dmesg. If it is there, then we don't need a > guest and you could try the same procedure with the non-RT kernel to > see if this problem is specific to RT or not. I'll update the results after rebooting host and waiting 2 hours later. Thank you. Best regards, Pei (In reply to Pei Zhang from comment #5) [...] > > Maybe Marcelo will have better recommendations, but I think I'd first > > try to determine if we need a guest to cause this issue. One way of > > doing this is to just reboot the system and wait 1 hour or 2. Then > > check if the error is in dmesg. If it is there, then we don't need a > > guest and you could try the same procedure with the non-RT kernel to > > see if this problem is specific to RT or not. > > I'll update the results after rebooting host and waiting 2 hours later. Waiting 2 hours after rebooting rt host, there is no TSC related log in rt host. Best regards, Pei Thanks for this info. There are two other things you could do:
- Since this appears to be a regression, you could go back some kernels
to see if the issue goes away. If it does, then that's excellent news
- Try the following procedure to see if it reproduces without KVM:
1. Reboot the system
2. Write a hog app, say hog.c:
int main(void) { for (;;); return 0 };
3. Build it and dun it with fifo:1 pinned to CPU19:
# chrt -f 1 taskset -c 19 hog
4. Wait a few hours to see if it will trig the issue
If none of the above reveals any new information, then KVM is probably
involved in this issue and we'll have to debug it. I'll try to reproduce it too.
(In reply to Luiz Capitulino from comment #7) > Thanks for this info. There are two other things you could do: > > - Since this appears to be a regression, you could go back some kernels > to see if the issue goes away. If it does, then that's excellent news > > - Try the following procedure to see if it reproduces without KVM: > > 1. Reboot the system > > 2. Write a hog app, say hog.c: > > int main(void) { for (;;); return 0 }; > > 3. Build it and dun it with fifo:1 pinned to CPU19: > > # chrt -f 1 taskset -c 19 hog > > 4. Wait a few hours to see if it will trig the issue > > If none of the above reveals any new information, then KVM is probably > involved in this issue and we'll have to debug it. I'll try to reproduce it > too. OK. I'm doing these 2 things. Will update the results once finish. Best regards, Pei There are two things I forgot to tell you. The first one is that our first step for this BZ should be to make it reproducible. That is, boot the machine, run the single guest test-case and monitor dmesg to see how long we need until the kernel marks the TSC as unstable. Then we do what I described in comment 7, since we'll know how long it takes to reproduce. The other thing is, I think you could try with the non-RT kernel too as a last resort (ie. after trying the 2 things from comment 7). The latencies may not be good with the non-RT kernel, but we just want to know if the TSC will become unstable too. Pei, I think I may have found the root cause of this issue. Please, let's cancel the debugging from comment 7 and comment 9. I'm working on a kernel that I plan to post here soon for you to test. (In reply to Luiz Capitulino from comment #10) > Pei, > > I think I may have found the root cause of this issue. Please, let's > cancel the debugging from comment 7 and comment 9. I'm working on > a kernel that I plan to post here soon for you to test. Got it. Best regards, Pei Hello! (In reply to Luiz Capitulino from comment #3) > I took a very quick look and it seems we have two problems: > > 1. Sometime after booting, the kernel decides that the TSC on CPU19 is > unstable > and changes the clocksource to hpet. I didn't know the kernel could mark > the TSC of a single CPU as unstable, but in any case KVM-RT requires a > stable TSC (and you're using CPU19 on your KVM-RT guest) > > [ 2611.752700] clocksource: timekeeping watchdog on CPU19: Marking > clocksource 'tsc' as unstable because the skew is too large: > [ 2611.765227] clocksource: 'hpet' wd_now: b5059197 > wd_last: bbcfe69d mask: ffffffff > [ 2611.775703] clocksource: 'tsc' cs_now: 5d733215270 > cs_last: 218c2d31186 mask: ffffffffffffffff > [ 2611.787464] tsc: Marking TSC unstable due to clocksource watchdog > [ 2611.787472] TSC found unstable after boot, most likely due to broken > BIOS. Use 'tsc=unstable'. > [ 2611.787475] sched_clock: Marking unstable (2611880509082, > -95077647)<-(2611847545715, -60076864) > [ 2611.822971] clocksource: Switched to clocksource hpet Yes this can slowdown (although not that significantly) So the And BTW, such machines (with unstable TSC) should not be supported by RHEL. Separate bug, though. > > 2. tuned is failing when applying the realtime-virtual-host profile: > > 2019-01-14 09:01:12,100 ERROR tuned.plugins.plugin_script: script > '/usr/lib/tuned/realtime-virtual-host/script.sh' error output: 'Failed to > start rt-entsk.service: Unit rt-entsk.service not found.' Make sure rt-entsk binary is running to workaround the chronyd initiated problem. > 2019-01-14 09:01:12,156 ERROR tuned.units.manager: BUG: Unhandled > exception in start_tuning: dictionary changed size during iteration > > Although the TSC issue seems quite serious and impactful for KVM-RT, I can't > tell with 100% certainty that one or both issues are the root cause of the > spike. We'll have to fix them to know it. > > I'll open a different BZ for issue 2 and will follow up with instructions > for issue 1. (In reply to Marcelo Tosatti from comment #12) > And BTW, such machines (with unstable TSC) should not be supported by > RHEL. It is actually a false positive, I'll follow up with an explanation. > > 2. tuned is failing when applying the realtime-virtual-host profile: > > > > 2019-01-14 09:01:12,100 ERROR tuned.plugins.plugin_script: script > > '/usr/lib/tuned/realtime-virtual-host/script.sh' error output: 'Failed to > > start rt-entsk.service: Unit rt-entsk.service not found.' > > Make sure rt-entsk binary is running to workaround the chronyd initiated > problem. As reported in bug 1666562, this binary doesn't seem to exist in RHEL8. I was able to reproduce this issue and I believe I have debugged it down to the root cause. What follows is my hypothesis. However, I'm still running a long duration test to confirm my findinds and would like Pei to do the same before we commit to moving forward. The kernel clocksource code has a feature called the clocksource watchdog. The goal of the clocksource watchdog is to detect unstable TSCs at run-time. It does this by scheduling a timer on each CPU every few seconds. The timer handler will read the time from different sources and do some calculations to find out if there's a skew in the TSC. It's a complex code that take locks, disable interrupts, etc. In the host, since our RT threads run at 100%, the clocksource watchdog seems to hang between clock reads, which causes it to think there's a skew, which in turn cause it to mark the TSC as unstable. In the guest, it actually succeeds running, however it causes nohz full to de-activate. In both cases it also generates a lot of interrupts. Everything contributes to the big spike we're seeing. Now, why are we only seeing this in RHEL8? Because in RHEL7, the clocksource watchdog was optional: you can disable it and we did disable it for RHEL7. However, on recent upstream and RHEL8, the clocksource watchdog is hardcoded enabled in x86_64. It seems to be a platform requirement to have it. To test my hypothesis, I forcefully disabled the clocksource watchdog in the code. Which fixed the TSC unstable thing and the spike. PS: I use this machine since the beggining of time for KVM-RT and never had TSC problems with it (just like Pei) PPS: Note that we also have tuned issues. I had to downgrade my tuned, which caused a bunch of issues on its own but I don't have spikes anymore 12 hours testing looks good too, didn't hit huge spike. (1)Single VM with 1 rt vCPU: # Min Latencies: 00007 # Avg Latencies: 00014 # Max Latencies: 00031 (2)Multiple VMs each with 1 rt vCPU: -VM1 # Min Latencies: 00007 # Avg Latencies: 00014 # Max Latencies: 00032 -VM2 # Min Latencies: 00007 # Avg Latencies: 00014 # Max Latencies: 00031 -VM3 # Min Latencies: 00007 # Avg Latencies: 00015 # Max Latencies: 00037 -VM4 # Min Latencies: 00007 # Avg Latencies: 00015 # Max Latencies: 00032 Versions: qemu-kvm-3.1.0-5.module+el8+2708+fbd828c6.x86_64 tuned-2.10.0-14.el8.noarch libvirt-4.10.0-1.module+el8+2317+367e35b5.x86_64 kernel-rt-4.18.0-60.rt9.112.el8.test.x86_64 Beaker job: https://beaker.engineering.redhat.com/recipes/6422241#tasks Excellent! So, this means the root cause from comment 14 is correct. Pei, if you have the bandwidth to now run the complete suite for 24 hours, you could do it. This is very useful, because we'll test most components except for tuned (which is older). (In reply to Luiz Capitulino from comment #19) > Excellent! So, this means the root cause from comment 14 is correct. > > Pei, if you have the bandwidth to now run the complete suite for > 24 hours, you could do it. This is very useful, because we'll > test most components except for tuned (which is older). Hi Luiz, OK. I'm testing full suit for 24 hours now. Will update the results once finish. About tuned, I think we can test with latest version. As Bug 1666562 mentioned, need to install rt-setup manually. I tried and rt-entsk.service ERROR disappears. Also, In above 12 hours tests in Comment 18, I've tried with latest tuned version and it looks good. Please let me know if you have any comment. Thanks. Best regards, Pei (In reply to Pei Zhang from comment #20) > (In reply to Luiz Capitulino from comment #19) > > Excellent! So, this means the root cause from comment 14 is correct. > > > > Pei, if you have the bandwidth to now run the complete suite for > > 24 hours, you could do it. This is very useful, because we'll > > test most components except for tuned (which is older). > > Hi Luiz, > > OK. I'm testing full suit for 24 hours now. Will update the results once > finish. > > About tuned, I think we can test with latest version. As Bug 1666562 > mentioned, need to install rt-setup manually. I tried and rt-entsk.service > ERROR disappears. Also, In above 12 hours tests in Comment 18, I've tried > with latest tuned version and it looks good. Yes, you're right about bug 1666562. But note that we also have bug 1669294, which doesn't seem to happen 100% of the time. So, before starting the test you can check /var/log/tuned.log and see if you have this error message: 2019-01-16 03:22:34,859 ERROR tuned.units.manager: BUG: Unhandled exception in start_tuning: dictionary changed size during iteration If you do, reboot & try again. Note that the error message has to contain "start_tuning" to be harmful, if it contains "stop_tuning" then it's fine to run the test since it means the bug triggered when rebooting. (In reply to Luiz Capitulino from comment #21) > (In reply to Pei Zhang from comment #20) > > (In reply to Luiz Capitulino from comment #19) > > > Excellent! So, this means the root cause from comment 14 is correct. > > > > > > Pei, if you have the bandwidth to now run the complete suite for > > > 24 hours, you could do it. This is very useful, because we'll > > > test most components except for tuned (which is older). > > > > Hi Luiz, > > > > OK. I'm testing full suit for 24 hours now. Will update the results once > > finish. > > > > About tuned, I think we can test with latest version. As Bug 1666562 > > mentioned, need to install rt-setup manually. I tried and rt-entsk.service > > ERROR disappears. Also, In above 12 hours tests in Comment 18, I've tried > > with latest tuned version and it looks good. > > Yes, you're right about bug 1666562. But note that we also have bug 1669294, > which doesn't seem to happen 100% of the time. So, before starting the > test you can check /var/log/tuned.log and see if you have this error > message: > > 2019-01-16 03:22:34,859 ERROR tuned.units.manager: BUG: Unhandled > exception in start_tuning: dictionary changed size during iteration > If you do, reboot & try again. Note that the error message has to > contain "start_tuning" to be harmful, if it contains "stop_tuning" > then it's fine to run the test since it means the bug triggered > when rebooting. Make sense. I checked and this "start_tuning" error still exists in latest tuned version. So I agree and I'll change back to use old tuned-2.10.0-7.el8.noarch version. Thanks you Luiz for reminding. Best regards, Pei Pei, You can try rebooting to see if it goes away. On my system, it takes two or three reboots for it to trigger. Testing kernel with Luiz's patch, full suit cases tests of 24 hours get pass. Max latency looks good. (1)Single VM with 1 rt vCPU: # Min Latencies: 00007 # Avg Latencies: 00014 # Max Latencies: 00033 (2)Multiple VMs each with 1 rt vCPU: -VM1 # Min Latencies: 00007 # Avg Latencies: 00014 # Max Latencies: 00037 -VM2 # Min Latencies: 00007 # Avg Latencies: 00014 # Max Latencies: 00031 -VM3 # Min Latencies: 00007 # Avg Latencies: 00014 # Max Latencies: 00035 -VM4 # Min Latencies: 00007 # Avg Latencies: 00014 # Max Latencies: 00032 (3)Single VM with 8 rt vCPUs: # Min Latencies: 00013 00007 00012 00012 00012 00014 00013 00013 # Avg Latencies: 00017 00014 00017 00017 00017 00017 00017 00017 # Max Latencies: 00036 00026 00029 00028 00029 00029 00028 00034 Versions: kernel-rt-4.18.0-60.rt9.112.el8.test.x86_64 libvirt-4.10.0-1.module+el8+2317+367e35b5.x86_64 tuned-2.10.0-7.el8.noarch qemu-kvm-3.1.0-7.module+el8+2715+f4b84bed.x86_64 Beaker job: https://beaker.engineering.redhat.com/recipes/6431467#tasks (In reply to Luiz Capitulino from comment #23) > Pei, > > You can try rebooting to see if it goes away. On my system, it takes two or > three > reboots for it to trigger. OK, I'll have a try manually. Thanks. Best regards, Pei Here's a description of the test-case that reproduces the bug. In the host, we isolate one or more CPUs (that is, we move off all user-space tasks, all kernel threads that can be moved, set irq masks, workqueue cpumasks, etc). Then, we start a guest with two CPUs, each of them pinned to an isolated CPU in the host. In the guest, one of the CPUs is isolated just like in the host, the other is not so that it can run housekeeping tasks. In the host, the guest CPU threads are assigned RT priority. In the guest, we start a hog thread, and pin it to the isolated CPU. We also run cyclictest on that CPU, with the intent of measuring all the interruptions the hog thread would incur. The issue described in this bug happens towards the end of the test-run, for tests with at least 30 minutes of duration. (In reply to Luiz Capitulino from comment #26) > Here's a description of the test-case that reproduces the bug. > > In the host, we isolate one or more CPUs (that is, we move off all > user-space tasks, all kernel threads that can be moved, set irq > masks, workqueue cpumasks, etc). Then, we start a guest with two > CPUs, each of them pinned to an isolated CPU in the host. In the guest, > one of the CPUs is isolated just like in the host, the other is not > so that it can run housekeeping tasks. In the host, the guest CPU > threads are assigned RT priority. > > In the guest, we start a hog thread, and pin it to the isolated CPU. > We also run cyclictest on that CPU, with the intent of measuring > all the interruptions the hog thread would incur. > > The issue described in this bug happens towards the end of the test-run, > for tests with at least 30 minutes of duration. Hi Luiz, I believe I've been able to tune a system to achieve acceptable latencies (both host and guest). So, I'd like to reproduce the problem. Re-reading the comments, however, it's not clear to me whether the hog thread is supposed to be running in the host or in the guest and if it's a FIFO or NORMAL task. If it's FIFO, what is the priority it is running at? What about cyclictest and kwatchdogd(s)? Could you please clarify? Thanks! I've been able to reproduce both symptoms (latency spike and tsc unstable warning) running the scripts Luiz shared on a system configured following kvm-rt tuning guide. However, I'm not able to reproduce the tsc unstable warning anymore after upgrading to latest version of python-linux-procfs (https://bugzilla.redhat.com/show_bug.cgi?id=1669294). My understanding is that, before the above package got fixed, tuned was failing to apply required tuning at boot and this was causing, for example, ktimersoftd to run at the same priority of the isolated guest vcpu. It couldn't thus preempt the vcpu when required (for running clocksource watchdog calculations) and was ending up seeing huge clock skew when finally got to run (thus disabling tsc and printing the warning). That said, I'm still seeing differences between having clocksource watchdog enabled or not, but not so big. In tests I performed (a couple of hours runtime) having watchdog enabled showed ~50us max latency against ~30us when disabling it. I also performed a test disabling the watchdog only inside the guest and this one showed ~30us max latency as well. So, I was wondering if it could make sense to run the tests again on the reference test box(es) just to check where we stand after packages affecting tuned behavior got fixed. In case we still see latencies that are not acceptable it would maybe interesting to disable clocksource watchdog inside the guest only. I build a kernel where an additional cmdline can enable (by default) or disable the watchdog. "nocswd" disables it at boot. It might be useful to quickly test different configurations: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=20274388 Testing with latest kernel-rt can work as well of course to see where we stand. Thanks! (In reply to Juri Lelli from comment #29) > I've been able to reproduce both symptoms (latency spike and tsc > unstable warning) running the scripts Luiz shared on a system > configured following kvm-rt tuning guide. > > However, I'm not able to reproduce the tsc unstable warning anymore > after upgrading to latest version of python-linux-procfs > (https://bugzilla.redhat.com/show_bug.cgi?id=1669294). > My understanding is that, before the above package got fixed, > tuned was failing to apply required tuning at boot and this > was causing, for example, ktimersoftd to run at the same > priority of the isolated guest vcpu. It couldn't thus preempt > the vcpu when required (for running clocksource watchdog > calculations) and was ending up seeing huge clock skew when > finally got to run (thus disabling tsc and printing the > warning). This makes a lot of sense! Thanks a lot for checking this out. Now with bug 1669294 fixed, we went from total catastrophe to having a spike. > That said, I'm still seeing differences between having > clocksource watchdog enabled or not, but not so big. In tests > I performed (a couple of hours runtime) having watchdog > enabled showed ~50us max latency against ~30us when disabling > it. I also performed a test disabling the watchdog only > inside the guest and this one showed ~30us max latency as > well. > > So, I was wondering if it could make sense to run the > tests again on the reference test box(es) just to check > where we stand after packages affecting tuned behavior > got fixed. In case we still see latencies that are not > acceptable it would maybe interesting to disable clocksource > watchdog inside the guest only. > > I build a kernel where an additional cmdline can enable > (by default) or disable the watchdog. "nocswd" disables > it at boot. It might be useful to quickly test different > configurations: > https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=20274388 I think it makes sense to do this testing. However, going from 30us to 50us is still a big spike. So, I still think we have to disable the clocksource watchdog. Pei, can you run this testing? Basically, you have to install the fixed version of python-linux-procfs and run all three test-cases against latest RHEL8 RT kernel and Juri's kernel. I think running for 1 hour is enough. Btw, Juri, I don't think your comment has to be private. So, if you make your comment public, you can make mine too. > > Testing with latest kernel-rt can work as well of course > to see where we stand. > > Thanks! (In reply to Luiz Capitulino from comment #30) [...] > Pei, can you run this testing? Basically, you have to install > the fixed version of python-linux-procfs and run all three > test-cases against latest RHEL8 RT kernel and Juri's kernel. > I think running for 1 hour is enough. Maybe also double check that ktimersoftd(s) have FIFO:3 priority, ksoftirqd(s) have FIFO:2 priority and that timer_migration is disabled (/proc/sys/kernel/timer_migration == 0) before starting the tests (if possible). Both host and guest. Hi Luiz, Juri, Currently I just finished one test case testing based on above testing requirements, there is no huge latency any more, however the max latency still exceeds 40us for both 2 versions: (1) With 4.18.0-68.rt9.123.el8.x86_64 Test scenario: Single VM with single rt vCPU 1 hour max latency is: # Min Latencies: 00007 # Avg Latencies: 00016 # Max Latencies: 00045 (2) With 4.18.0-68.rt9.123.el8.nocswd.x86_64 (build provided by Juri) Test scenario: Single VM with single rt vCPU Host kernel line (adding nocswd): # cat /proc/cmdline BOOT_IMAGE=(hd0,msdos1)/vmlinuz-4.18.0-68.rt9.123.el8.nocswd.x86_64 root=/dev/mapper/rhel_dell--per430--11-root ro crashkernel=auto resume=/dev/mapper/rhel_dell--per430--11-swap rd.lvm.lv=rhel_dell-per430-11/root rd.lvm.lv=rhel_dell-per430-11/swap console=ttyS0,115200n81 skew_tick=1 isolcpus=2,4,6,8,10,12,14,16,18,19,17,15,13 intel_pstate=disable nosoftlockup nohz=on nohz_full=2,4,6,8,10,12,14,16,18,19,17,15,13 rcu_nocbs=2,4,6,8,10,12,14,16,18,19,17,15,13 nocswd default_hugepagesz=1G iommu=pt intel_iommu=on Guest kernel line(adding nocswd): BOOT_IMAGE=(hd0,msdos1)/vmlinuz-4.18.0-68.rt9.123.el8.nocswd.x86_64 root=/dev/mapper/rhel_vm--74--52-root ro console=tty0 console=ttyS0,115200n8 biosdevname=0 crashkernel=auto resume=/dev/mapper/rhel_vm--74--52-swap rd.lvm.lv=rhel_vm-74-52/root rd.lvm.lv=rhel_vm-74-52/swap skew_tick=1 isolcpus=1 intel_pstate=disable nosoftlockup nohz=on nohz_full=1 rcu_nocbs=1 nocswd default_hugepagesz=1G iommu=pt intel_iommu=on 1 hour max latency is: # Min Latencies: 00008 # Avg Latencies: 00015 # Max Latencies: 00049 Other infos for both 2 versions: 1. Versions besides kernel-rt: tuned-2.10.0-14.el8.noarch python3-linux-procfs-0.6-6.el8.noarch 2. For both host and guest: # cat /proc/sys/kernel/timer_migration 0 Juri, could you please share how to check the FIFO value of ktimersoftd(s) and ksoftirqd(s)? Thanks. Best regards, Pei (In reply to Pei Zhang from comment #32) > Hi Luiz, Juri, > > Currently I just finished one test case testing based on above testing > requirements, there is no huge latency any more, however the max latency > still exceeds 40us for both 2 versions: > > (1) With 4.18.0-68.rt9.123.el8.x86_64 > > Test scenario: Single VM with single rt vCPU > > 1 hour max latency is: > > # Min Latencies: 00007 > # Avg Latencies: 00016 > # Max Latencies: 00045 > > > (2) With 4.18.0-68.rt9.123.el8.nocswd.x86_64 (build provided by Juri) > > Test scenario: Single VM with single rt vCPU > > Host kernel line (adding nocswd): > > # cat /proc/cmdline > BOOT_IMAGE=(hd0,msdos1)/vmlinuz-4.18.0-68.rt9.123.el8.nocswd.x86_64 > root=/dev/mapper/rhel_dell--per430--11-root ro crashkernel=auto > resume=/dev/mapper/rhel_dell--per430--11-swap > rd.lvm.lv=rhel_dell-per430-11/root rd.lvm.lv=rhel_dell-per430-11/swap > console=ttyS0,115200n81 skew_tick=1 > isolcpus=2,4,6,8,10,12,14,16,18,19,17,15,13 intel_pstate=disable > nosoftlockup nohz=on nohz_full=2,4,6,8,10,12,14,16,18,19,17,15,13 > rcu_nocbs=2,4,6,8,10,12,14,16,18,19,17,15,13 nocswd default_hugepagesz=1G > iommu=pt intel_iommu=on > > Guest kernel line(adding nocswd): > > BOOT_IMAGE=(hd0,msdos1)/vmlinuz-4.18.0-68.rt9.123.el8.nocswd.x86_64 > root=/dev/mapper/rhel_vm--74--52-root ro console=tty0 console=ttyS0,115200n8 > biosdevname=0 crashkernel=auto resume=/dev/mapper/rhel_vm--74--52-swap > rd.lvm.lv=rhel_vm-74-52/root rd.lvm.lv=rhel_vm-74-52/swap skew_tick=1 > isolcpus=1 intel_pstate=disable nosoftlockup nohz=on nohz_full=1 rcu_nocbs=1 > nocswd default_hugepagesz=1G iommu=pt intel_iommu=on > > 1 hour max latency is: > > # Min Latencies: 00008 > # Avg Latencies: 00015 > # Max Latencies: 00049 Mmm. Would you be able to trace both guest and host while idle with something like the following? trace-cmd start -e 'sched:sched_wakeup' -e 'sched:sched_migrate_task' -e 'sched:sched_switch' -e 'timer:*' -e 'irq:*' -b 10000 && sleep 5 && trace-cmd stop && trace-cmd extract trace-cmd report 2>/dev/null | grep clocksource_watchdog | wc -l Just to check that clocksource watchdog is in fact disabled. > > > > Other infos for both 2 versions: > > 1. Versions besides kernel-rt: > > tuned-2.10.0-14.el8.noarch > python3-linux-procfs-0.6-6.el8.noarch > > 2. For both host and guest: > > # cat /proc/sys/kernel/timer_migration > 0 > > Juri, could you please share how to check the FIFO value of ktimersoftd(s) > and ksoftirqd(s)? Thanks. Sure. for PID in $(pgrep ksoftirqd); do chrt -p ${PID}; done for PID in $(pgrep ktimersoftd); do chrt -p ${PID}; done Thanks. Also, what is the value for the following (on the host) you run with? /sys/module/kvm/parameters/lapic_timer_advance_ns I saw that it has quite an impact on latencies. (Anything below 3000 doesn't really work for me). (In reply to Juri Lelli from comment #33) > > for PID in $(pgrep ksoftirqd); do chrt -p ${PID}; done pid 10's current scheduling policy: SCHED_FIFO pid 10's current scheduling priority: 2 ... > for PID in $(pgrep ktimersoftd); do chrt -p ${PID}; done pid 11's current scheduling policy: SCHED_FIFO pid 11's current scheduling priority: 3 ... > Thanks. (In reply to Juri Lelli from comment #34) > Also, what is the value for the following (on the host) you run with? > > /sys/module/kvm/parameters/lapic_timer_advance_ns > > I saw that it has quite an impact on latencies. (Anything below 3000 > doesn't really work for me). # cat /sys/module/kvm/parameters/lapic_timer_advance_ns 0 (In reply to Pei Zhang from comment #35) > (In reply to Juri Lelli from comment #33) > > > > for PID in $(pgrep ksoftirqd); do chrt -p ${PID}; done > > pid 10's current scheduling policy: SCHED_FIFO > pid 10's current scheduling priority: 2 > ... > > > for PID in $(pgrep ktimersoftd); do chrt -p ${PID}; done > > pid 11's current scheduling policy: SCHED_FIFO > pid 11's current scheduling priority: 3 > ... > They look OK. Thanks for checking. > (In reply to Juri Lelli from comment #34) > > Also, what is the value for the following (on the host) you run with? > > > > /sys/module/kvm/parameters/lapic_timer_advance_ns > > > > I saw that it has quite an impact on latencies. (Anything below 3000 > > doesn't really work for me). > > # cat /sys/module/kvm/parameters/lapic_timer_advance_ns > 0 This instead I think needs tuning. I'd start with 1000, but, as said, in my test box I couldn't get below 40us latencies until I put 3000 in this parameter (I believe tuning guide advise this tuning strategy as well). Could you please try again with 1000 and increments of 500 until we hopefully see acceptable latencies? I'd anyway stop at 5000. Thanks! (In reply to Juri Lelli from comment #33) [...] > > Mmm. Would you be able to trace both guest and host while idle with > something like the following? > > trace-cmd start -e 'sched:sched_wakeup' -e 'sched:sched_migrate_task' -e > 'sched:sched_switch' -e 'timer:*' -e 'irq:*' -b 10000 && sleep 5 && > trace-cmd stop && trace-cmd extract > > trace-cmd report 2>/dev/null | grep clocksource_watchdog | wc -l > > Just to check that clocksource watchdog is in fact disabled. > # trace-cmd start -e 'sched:sched_wakeup' -e 'sched:sched_migrate_task' -e 'sched:sched_switch' -e 'timer:*' -e 'irq:*' -b 10000 && sleep 5 && trace-cmd stop && trace-cmd extract CPU0 data recorded at offset=0x5b5000 3477504 bytes in size CPU1 data recorded at offset=0x906000 0 bytes in size # trace-cmd report 2>/dev/null | grep clocksource_watchdog | wc -l 0 Is this expected? (In reply to Juri Lelli from comment #36) > (In reply to Pei Zhang from comment #35) > > (In reply to Juri Lelli from comment #33) > > > > > > for PID in $(pgrep ksoftirqd); do chrt -p ${PID}; done > > > > pid 10's current scheduling policy: SCHED_FIFO > > pid 10's current scheduling priority: 2 > > ... > > > > > for PID in $(pgrep ktimersoftd); do chrt -p ${PID}; done > > > > pid 11's current scheduling policy: SCHED_FIFO > > pid 11's current scheduling priority: 3 > > ... > > > > They look OK. Thanks for checking. > > > (In reply to Juri Lelli from comment #34) > > > Also, what is the value for the following (on the host) you run with? > > > > > > /sys/module/kvm/parameters/lapic_timer_advance_ns > > > > > > I saw that it has quite an impact on latencies. (Anything below 3000 > > > doesn't really work for me). > > > > # cat /sys/module/kvm/parameters/lapic_timer_advance_ns > > 0 > > This instead I think needs tuning. I'd start with 1000, but, as said, > in my test box I couldn't get below 40us latencies until I put 3000 in > this parameter (I believe tuning guide advise this tuning strategy as > well). > > Could you please try again with 1000 and increments of 500 until we > hopefully see acceptable latencies? I'd anyway stop at 5000. OK. I'll have a try. Best regards, Pei (In reply to Pei Zhang from comment #37) > (In reply to Juri Lelli from comment #33) > [...] > > > > Mmm. Would you be able to trace both guest and host while idle with > > something like the following? > > > > trace-cmd start -e 'sched:sched_wakeup' -e 'sched:sched_migrate_task' -e > > 'sched:sched_switch' -e 'timer:*' -e 'irq:*' -b 10000 && sleep 5 && > > trace-cmd stop && trace-cmd extract > > > > trace-cmd report 2>/dev/null | grep clocksource_watchdog | wc -l > > > > Just to check that clocksource watchdog is in fact disabled. > > > > # trace-cmd start -e 'sched:sched_wakeup' -e 'sched:sched_migrate_task' -e > 'sched:sched_switch' -e 'timer:*' -e 'irq:*' -b 10000 && sleep 5 && > trace-cmd stop && trace-cmd extract > CPU0 data recorded at offset=0x5b5000 > 3477504 bytes in size > CPU1 data recorded at offset=0x906000 > 0 bytes in size > > # trace-cmd report 2>/dev/null | grep clocksource_watchdog | wc -l > 0 > > Is this expected? If kernel booted with "nocswd" yes. Getting something different from 0 would mean that some watchdog has been running. (In reply to Juri Lelli from comment #36) > > (In reply to Juri Lelli from comment #34) > > > Also, what is the value for the following (on the host) you run with? > > > > > > /sys/module/kvm/parameters/lapic_timer_advance_ns > > > > > > I saw that it has quite an impact on latencies. (Anything below 3000 > > > doesn't really work for me). > > > > # cat /sys/module/kvm/parameters/lapic_timer_advance_ns > > 0 > > This instead I think needs tuning. I'd start with 1000, but, as said, > in my test box I couldn't get below 40us latencies until I put 3000 in > this parameter (I believe tuning guide advise this tuning strategy as > well). > > Could you please try again with 1000 and increments of 500 until we > hopefully see acceptable latencies? I'd anyway stop at 5000. Juri, I think you just found another important issue! This manual work you're (rightfully) doing, is supposed to be automated by the tuned host profile and hence lapic_timer_advance_ns is supposed to be non-zero. If that's not the case, then that's probably a bug. Would you please open a BZ? The component should be tuned, and the affected profile is the realtime-virtual-host one. You can assign it to me. (In reply to Luiz Capitulino from comment #40) > (In reply to Juri Lelli from comment #36) > > > > (In reply to Juri Lelli from comment #34) > > > > Also, what is the value for the following (on the host) you run with? > > > > > > > > /sys/module/kvm/parameters/lapic_timer_advance_ns > > > > > > > > I saw that it has quite an impact on latencies. (Anything below 3000 > > > > doesn't really work for me). > > > > > > # cat /sys/module/kvm/parameters/lapic_timer_advance_ns > > > 0 > > > > This instead I think needs tuning. I'd start with 1000, but, as said, > > in my test box I couldn't get below 40us latencies until I put 3000 in > > this parameter (I believe tuning guide advise this tuning strategy as > > well). > > > > Could you please try again with 1000 and increments of 500 until we > > hopefully see acceptable latencies? I'd anyway stop at 5000. > > Juri, > > I think you just found another important issue! This manual work you're > (rightfully) doing, is supposed to be automated by the tuned host profile > and hence lapic_timer_advance_ns is supposed to be non-zero. If that's > not the case, then that's probably a bug. Would you please open a BZ? The > component should be tuned, and the affected profile is the > realtime-virtual-host > one. You can assign it to me. Luiz, I've filed this bz[1], already cc you and Juri. [1]Bug 1680199 - default value of lapic_timer_advance_ns is set 0, but we expect non-zero value No lapic_timer_advance_ns version max_latency 1 0 4.18.0-71.rt9.127.el8 63us 2 0 4.18.0-68.rt9.123.el8.nocswd 67us 3 1000 4.18.0-71.rt9.127.el8 46us 4 1000 4.18.0-68.rt9.123.el8.nocswd 47us 5 2000 4.18.0-71.rt9.127.el8 47us 6 2000 4.18.0-68.rt9.123.el8.nocswd 38us Beak links: 1.https://beaker.engineering.redhat.com/jobs/3361245 2.https://beaker.engineering.redhat.com/jobs/3361244 3.https://beaker.engineering.redhat.com/jobs/3362923 (manaully run the 3rd scenario in this run) 4.https://beaker.engineering.redhat.com/jobs/3362924 5.https://beaker.engineering.redhat.com/jobs/3363722 6.https://beaker.engineering.redhat.com/jobs/3363720 In Comment 42, I tested all 3 kvm-rt standard scenarios and each running 1 hour cyclictest. Thanks for doing all the runs! Even though run 6. is meeting latency requirements already, given the trend, I wonder if it would make sense to further test with 2500, 3000 (at least) and see if latency keeps getting smaller. Luiz? Yeah, I think it's good to have the numbers. But I don't think we can avoid disabling the clocksource watchdog: if it's not causing a spike, then the timer is probably starving on the cores where there's vcpu threads pinned running the test-case. Also, Pei, did you run the nocswd kernel in the guest? If not, you should since the clocksource watchdog will run there too. (In reply to Luiz Capitulino from comment #45) > Yeah, I think it's good to have the numbers. But I don't think we can avoid > disabling the clocksource watchdog: if it's not causing a spike, then the > timer is probably starving on the cores where there's vcpu threads pinned > running the test-case. Mmm, why so? ktimersoftd(s) run it, and they should have higher priority than the vcpus threads on the host. That said, I'm not in principle against disabling the watchdog. Just making sure we agree on why we are disabling it and potentially collecting the best info we can collect to report upstream. > Also, Pei, did you run the nocswd kernel in the guest? If not, you should > since the clocksource watchdog will run there too. Agree. Quite a lot of configurations of configurations to test (sorry), but I believe it's good info at the end. Thanks! (In reply to Juri Lelli from comment #46) > (In reply to Luiz Capitulino from comment #45) > > Yeah, I think it's good to have the numbers. But I don't think we can avoid > > disabling the clocksource watchdog: if it's not causing a spike, then the > > timer is probably starving on the cores where there's vcpu threads pinned > > running the test-case. > > Mmm, why so? ktimersoftd(s) run it, and they should have higher priority than > the vcpus threads on the host. Oh, you're right. For some reason I thought they had the same priority, but you're right that ktimersoftd is higher. > That said, I'm not in principle against disabling the watchdog. Just making > sure we agree on why we are disabling it and potentially collecting the best > info we can collect to report upstream. So, I'm a bit surprised that the clocksource watchdog is not causing a higher latency in cyclictest. In any case, it's causing interruptions which is also bad for DPDK. There's a tool we use sometimes to collect interruptions called sysjitter, but you can probably do this with tracing. > > > Also, Pei, did you run the nocswd kernel in the guest? If not, you should > > since the clocksource watchdog will run there too. > > Agree. Quite a lot of configurations of configurations to test (sorry), but > I believe it's good info at the end. Yeah, this all looks good to me. (In reply to Juri Lelli from comment #44) > Thanks for doing all the runs! > > Even though run 6. is meeting latency requirements already, > given the trend, I wonder if it would make sense to further > test with 2500, 3000 (at least) and see if latency keeps > getting smaller. I run with lapic_timer_advance_ns=3000, each version running 2 times. No lapic_timer_advance_ns version max_latency 1 3000 4.18.0-73.rt9.129.el8 48us 2 3000 4.18.0-73.rt9.129.el8 48us 3 3000 4.18.0-68.rt9.123.el8.nocswd 31us 2 3000 4.18.0-68.rt9.123.el8.nocswd 37us Beak links: 1.https://beaker.engineering.redhat.com/jobs/3369048 2.https://beaker.engineering.redhat.com/jobs/3366432 3.https://beaker.engineering.redhat.com/jobs/3369047 4.https://beaker.engineering.redhat.com/jobs/3366436 (In reply to Luiz Capitulino from comment #45) > Yeah, I think it's good to have the numbers. But I don't think we can avoid > disabling the clocksource watchdog: if it's not causing a spike, then the > timer is probably starving on the cores where there's vcpu threads pinned > running the test-case. > > Also, Pei, did you run the nocswd kernel in the guest? If not, you should > since the clocksource watchdog will run there too. Yes, I'm running nocswd kernel in both host and guest. In our testings, I always run same versions on host and guest. I just discovered that clocksource watchdog doesn't run if "tsc=reliable" is added to the kernel cmdline. This should allow us to fix the issue at hand w/o additional kernel side patches. Worth a try? (In reply to Juri Lelli from comment #50) > I just discovered that clocksource watchdog doesn't run if > "tsc=reliable" is added to the kernel cmdline. > > This should allow us to fix the issue at hand w/o additional > kernel side patches. > > Worth a try? Hi Juri, Do you mean testing normal kernel-rt with adding tsc=reliable in both host and guest kernel line? Also add lapic_timer_advance_ns=3000,right? I can submit a 24 hours run with all 3 testing scenarios this weekend. Thanks. (In reply to Pei Zhang from comment #51) > (In reply to Juri Lelli from comment #50) > > I just discovered that clocksource watchdog doesn't run if > > "tsc=reliable" is added to the kernel cmdline. > > > > This should allow us to fix the issue at hand w/o additional > > kernel side patches. > > > > Worth a try? > > Hi Juri, > > Do you mean testing normal kernel-rt with adding tsc=reliable in both host > and guest kernel line? Yes, add "tsc=reliable" to the command line you already use for running tests. Both host and guest. And use latest (unpatched) kernel-rt. > > Also add lapic_timer_advance_ns=3000, right? Yes, that or best value, if tuned gets fixed. For now, 3000 should be anyway OK. > I can submit a 24 hours run with all 3 testing scenarios this weekend. That would be cool, thanks. (In reply to Juri Lelli from comment #52) > (In reply to Pei Zhang from comment #51) > > (In reply to Juri Lelli from comment #50) > > > I just discovered that clocksource watchdog doesn't run if > > > "tsc=reliable" is added to the kernel cmdline. > > > > > > This should allow us to fix the issue at hand w/o additional > > > kernel side patches. > > > > > > Worth a try? > > > > Hi Juri, > > > > Do you mean testing normal kernel-rt with adding tsc=reliable in both host > > and guest kernel line? > > Yes, add "tsc=reliable" to the command line you already use for > running tests. Both host and guest. And use latest (unpatched) > kernel-rt. > > > > > Also add lapic_timer_advance_ns=3000, right? > > Yes, that or best value, if tuned gets fixed. > > For now, 3000 should be anyway OK. > > > I can submit a 24 hours run with all 3 testing scenarios this weekend. > > That would be cool, thanks. Well noted. I'll test following these requirements. Thanks. Oh, actually I would keep kernel-rt version at 4.18.0-71.rt9.127.el8 I believe there are changes in newer kernels that might introduce variability that we don't want at this stage. Thanks! (In reply to Juri Lelli from comment #54) > Oh, actually I would keep kernel-rt version at 4.18.0-71.rt9.127.el8 > > I believe there are changes in newer kernels that might introduce > variability that we don't want at this stage. Ok. Get it. I'm submitting both 4.18.0-71.rt9.127.el8 and 4.18.0-74.rt9.130.el8(latest one) separately on 2 servers. > Thanks! (In reply to Juri Lelli from comment #50) > I just discovered that clocksource watchdog doesn't run if > "tsc=reliable" is added to the kernel cmdline. > > This should allow us to fix the issue at hand w/o additional > kernel side patches. > > Worth a try? Honest question: does tsc=reliable disable all tsc checks? If yes, what happens if we pass tsc=reliable and the tsc is not reliable? I guess we wouldn't be warned, right? I'm not sure we can afford that. Marcelo, what you think about this? Btw, I think we'd have to hardcode tsc=reliable in the tuned profiles... (In reply to Luiz Capitulino from comment #56) > (In reply to Juri Lelli from comment #50) > > I just discovered that clocksource watchdog doesn't run if > > "tsc=reliable" is added to the kernel cmdline. > > > > This should allow us to fix the issue at hand w/o additional > > kernel side patches. > > > > Worth a try? > > Honest question: does tsc=reliable disable all tsc checks? Not sure if it does something else then disabling the clocksource watchdog. > If yes, what happens if we pass tsc=reliable and the tsc is > not reliable? I guess we wouldn't be warned, right? I'm not sure > we can afford that. But, then won't we be in the same situation as with disabling clocksource watchdog only? That is the thing, AFAIK, that checks for TSC reliability at runtime. > > Marcelo, what you think about this? > > Btw, I think we'd have to hardcode tsc=reliable in the tuned > profiles... (In reply to Juri Lelli from comment #57) > (In reply to Luiz Capitulino from comment #56) > > (In reply to Juri Lelli from comment #50) > > > I just discovered that clocksource watchdog doesn't run if > > > "tsc=reliable" is added to the kernel cmdline. > > > > > > This should allow us to fix the issue at hand w/o additional > > > kernel side patches. > > > > > > Worth a try? > > > > Honest question: does tsc=reliable disable all tsc checks? > > Not sure if it does something else then disabling the clocksource > watchdog. > > > If yes, what happens if we pass tsc=reliable and the tsc is > > not reliable? I guess we wouldn't be warned, right? I'm not sure > > we can afford that. > > But, then won't we be in the same situation as with disabling > clocksource watchdog only? That is the thing, AFAIK, that checks > for TSC reliability at runtime. I think there are reliability checks at boot too. > > > > > Marcelo, what you think about this? > > > > Btw, I think we'd have to hardcode tsc=reliable in the tuned > > profiles... (In reply to Luiz Capitulino from comment #58) > (In reply to Juri Lelli from comment #57) > > (In reply to Luiz Capitulino from comment #56) > > > (In reply to Juri Lelli from comment #50) > > > > I just discovered that clocksource watchdog doesn't run if > > > > "tsc=reliable" is added to the kernel cmdline. > > > > > > > > This should allow us to fix the issue at hand w/o additional > > > > kernel side patches. > > > > > > > > Worth a try? > > > > > > Honest question: does tsc=reliable disable all tsc checks? > > > > Not sure if it does something else then disabling the clocksource > > watchdog. > > > > > If yes, what happens if we pass tsc=reliable and the tsc is > > > not reliable? I guess we wouldn't be warned, right? I'm not sure > > > we can afford that. > > > > But, then won't we be in the same situation as with disabling > > clocksource watchdog only? That is the thing, AFAIK, that checks > > for TSC reliability at runtime. > > I think there are reliability checks at boot too. Mmm, but even in this case, if tsc becomes unreliable after boot and we don't detect that, bad things would happen as well, no? So, it seems to me that on systems where we can afford to disable the watchdog, we better be sure that TSC is stable indeed. > > > > > > > > > Marcelo, what you think about this? > > > > > > Btw, I think we'd have to hardcode tsc=reliable in the tuned > > > profiles... (In reply to Juri Lelli from comment #59) > (In reply to Luiz Capitulino from comment #58) > > (In reply to Juri Lelli from comment #57) > > > (In reply to Luiz Capitulino from comment #56) > > > > (In reply to Juri Lelli from comment #50) > > > > > I just discovered that clocksource watchdog doesn't run if > > > > > "tsc=reliable" is added to the kernel cmdline. > > > > > > > > > > This should allow us to fix the issue at hand w/o additional > > > > > kernel side patches. > > > > > > > > > > Worth a try? > > > > > > > > Honest question: does tsc=reliable disable all tsc checks? > > > > > > Not sure if it does something else then disabling the clocksource > > > watchdog. > > > > > > > If yes, what happens if we pass tsc=reliable and the tsc is > > > > not reliable? I guess we wouldn't be warned, right? I'm not sure > > > > we can afford that. > > > > > > But, then won't we be in the same situation as with disabling > > > clocksource watchdog only? That is the thing, AFAIK, that checks > > > for TSC reliability at runtime. > > > > I think there are reliability checks at boot too. > > Mmm, but even in this case, if tsc becomes unreliable after boot and > we don't detect that, bad things would happen as well, no? Yes, but not having a runtime check has been the case for several years. Now, the question that I have is: is the boot check not enough? Or even unreliable so that the runtime check is in fact, necessary? If that's the case, then I agree disabling the clocksource watchdog will have the same effect as passing tsc=reliable. > So, it seems to me that on systems where we can afford to disable the > watchdog, we better be sure that TSC is stable indeed. KVM-RT does require a stable tsc. I think what I'm asking is, what are the implications of having tsc=reliable hardcoded in the tuned profiles? (In reply to Luiz Capitulino from comment #56) > (In reply to Juri Lelli from comment #50) > > I just discovered that clocksource watchdog doesn't run if > > "tsc=reliable" is added to the kernel cmdline. > > > > This should allow us to fix the issue at hand w/o additional > > kernel side patches. > > > > Worth a try? > > Honest question: does tsc=reliable disable all tsc checks? > If yes, what happens if we pass tsc=reliable and the tsc is > not reliable? I guess we wouldn't be warned, right? I'm not sure > we can afford that. > > Marcelo, what you think about this? > > Btw, I think we'd have to hardcode tsc=reliable in the tuned > profiles... First of all, i don't know what are all the consequences of tsc=reliable. But for the clocksource watchdog, what is necessary is to disable clocksource watchdog on a per-CPU basis, with justification that it adds X us to processing latency and that is not acceptable to the workload being handled (upstream patch). Which was not properly measured AFAIK. From what i understand from this thread, we are not certain how much latency TSC watchdog adds. About TSC going unstable, on comment 13 Luiz made a hypothesis: "In the host, since our RT threads run at 100%, the clocksource watchdog seems to hang between clock reads, which causes it to think there's a skew, which in turn cause it to mark the TSC as unstable." But later on the discussion it was noted that ktimersoftd priority is higher. So is the actual reason TSC is being marked unstable? (In reply to Luiz Capitulino from comment #60) > (In reply to Juri Lelli from comment #59) > > (In reply to Luiz Capitulino from comment #58) > > > (In reply to Juri Lelli from comment #57) > > > > (In reply to Luiz Capitulino from comment #56) > > > > > (In reply to Juri Lelli from comment #50) > > > > > > I just discovered that clocksource watchdog doesn't run if > > > > > > "tsc=reliable" is added to the kernel cmdline. > > > > > > > > > > > > This should allow us to fix the issue at hand w/o additional > > > > > > kernel side patches. > > > > > > > > > > > > Worth a try? > > > > > > > > > > Honest question: does tsc=reliable disable all tsc checks? > > > > > > > > Not sure if it does something else then disabling the clocksource > > > > watchdog. > > > > > > > > > If yes, what happens if we pass tsc=reliable and the tsc is > > > > > not reliable? I guess we wouldn't be warned, right? I'm not sure > > > > > we can afford that. > > > > > > > > But, then won't we be in the same situation as with disabling > > > > clocksource watchdog only? That is the thing, AFAIK, that checks > > > > for TSC reliability at runtime. > > > > > > I think there are reliability checks at boot too. > > > > Mmm, but even in this case, if tsc becomes unreliable after boot and > > we don't detect that, bad things would happen as well, no? > > Yes, but not having a runtime check has been the case for several years. > Now, the question that I have is: is the boot check not enough? Or > even unreliable so that the runtime check is in fact, necessary? > > If that's the case, then I agree disabling the clocksource watchdog > will have the same effect as passing tsc=reliable. > > > So, it seems to me that on systems where we can afford to disable the > > watchdog, we better be sure that TSC is stable indeed. > > KVM-RT does require a stable tsc. I think what I'm asking is, what are > the implications of having tsc=reliable hardcoded in the tuned profiles? You don't check whether TSC is unstable. So you rely on properly performing hardware to guarantee TSC stability (which is a bad idea). IMO should find out what is the reason for TSC being marked unstable. (In reply to Marcelo Tosatti from comment #62) > (In reply to Luiz Capitulino from comment #60) > > (In reply to Juri Lelli from comment #59) > > > (In reply to Luiz Capitulino from comment #58) > > > > (In reply to Juri Lelli from comment #57) > > > > > (In reply to Luiz Capitulino from comment #56) > > > > > > (In reply to Juri Lelli from comment #50) > > > > > > > I just discovered that clocksource watchdog doesn't run if > > > > > > > "tsc=reliable" is added to the kernel cmdline. > > > > > > > > > > > > > > This should allow us to fix the issue at hand w/o additional > > > > > > > kernel side patches. > > > > > > > > > > > > > > Worth a try? > > > > > > > > > > > > Honest question: does tsc=reliable disable all tsc checks? > > > > > > > > > > Not sure if it does something else then disabling the clocksource > > > > > watchdog. > > > > > > > > > > > If yes, what happens if we pass tsc=reliable and the tsc is > > > > > > not reliable? I guess we wouldn't be warned, right? I'm not sure > > > > > > we can afford that. > > > > > > > > > > But, then won't we be in the same situation as with disabling > > > > > clocksource watchdog only? That is the thing, AFAIK, that checks > > > > > for TSC reliability at runtime. > > > > > > > > I think there are reliability checks at boot too. > > > > > > Mmm, but even in this case, if tsc becomes unreliable after boot and > > > we don't detect that, bad things would happen as well, no? > > > > Yes, but not having a runtime check has been the case for several years. > > Now, the question that I have is: is the boot check not enough? Or > > even unreliable so that the runtime check is in fact, necessary? > > > > If that's the case, then I agree disabling the clocksource watchdog > > will have the same effect as passing tsc=reliable. > > > > > So, it seems to me that on systems where we can afford to disable the > > > watchdog, we better be sure that TSC is stable indeed. > > > > KVM-RT does require a stable tsc. I think what I'm asking is, what are > > the implications of having tsc=reliable hardcoded in the tuned profiles? > > You don't check whether TSC is unstable. So you rely on properly performing > hardware > to guarantee TSC stability (which is a bad idea). > > IMO should find out what is the reason for TSC being marked unstable. This was happening because tuned wasn't properly setting up ktimersoftd(s) priorities after reboot. And this issue has been fixed. (In reply to Marcelo Tosatti from comment #61) > About TSC going unstable, on comment 13 Luiz made a hypothesis: > > "In the host, since our RT threads run at 100%, the clocksource watchdog > seems to hang between clock reads, which causes it to think there's a > skew, which in turn cause it to mark the TSC as unstable." > > But later on the discussion it was noted that ktimersoftd priority is higher. > > So is the actual reason TSC is being marked unstable? Good point. Juri found out that due to a bug in tuned the ktimersoftd thread had a lower priority than the vcpu thread, so ktimersoftd was getting stuck. That's why there was a delta, and that's why the tsc was being marked as unstable. Now that bug is fixed, we don't see the tsc being marked as unstable anymore. However, having the timer firing every few seconds does increase the latency. That's the problem we want to fix next. What I'm hearing is that we have two options: 1. Use tsc=reliable in the profiles. To do that, we have to understand all the implications of having tsc=reliable (ie. what the kernel does or doesn't do) 2. Make clocksource watchdog respect a CPU mask provided by the user What do you guys think we should do? My only concern about item 2 is to get stuck on upstream for too long. Maybe we could use tsc=reliable as a workaround meanwhile? (In reply to Luiz Capitulino from comment #64) > (In reply to Marcelo Tosatti from comment #61) > > > About TSC going unstable, on comment 13 Luiz made a hypothesis: > > > > "In the host, since our RT threads run at 100%, the clocksource watchdog > > seems to hang between clock reads, which causes it to think there's a > > skew, which in turn cause it to mark the TSC as unstable." > > > > But later on the discussion it was noted that ktimersoftd priority is higher. > > > > So is the actual reason TSC is being marked unstable? > > Good point. Juri found out that due to a bug in tuned the ktimersoftd thread > had a lower priority than the vcpu thread, so ktimersoftd was getting stuck. > That's why there was a delta, and that's why the tsc was being marked as > unstable. Now that bug is fixed, we don't see the tsc being marked as > unstable anymore. > > However, having the timer firing every few seconds does increase the > latency. That's the problem we want to fix next. What I'm hearing is > that we have two options: > > 1. Use tsc=reliable in the profiles. To do that, we have to understand > all the implications of having tsc=reliable (ie. what the kernel does > or doesn't do) I'll look into this. > 2. Make clocksource watchdog respect a CPU mask provided by the user And already started asking upstream if this would be a sensible and viable option. First reply was however that this might not work, as checking for stability is an all-or-nothing thing. Will try to get a definitive answer from experts (Thomas for example). > What do you guys think we should do? My only concern about item 2 is to > get stuck on upstream for too long. Maybe we could use tsc=reliable > as a workaround meanwhile? (In reply to Juri Lelli from comment #65) > (In reply to Luiz Capitulino from comment #64) > > (In reply to Marcelo Tosatti from comment #61) > > > > > About TSC going unstable, on comment 13 Luiz made a hypothesis: > > > > > > "In the host, since our RT threads run at 100%, the clocksource watchdog > > > seems to hang between clock reads, which causes it to think there's a > > > skew, which in turn cause it to mark the TSC as unstable." > > > > > > But later on the discussion it was noted that ktimersoftd priority is higher. > > > > > > So is the actual reason TSC is being marked unstable? > > > > Good point. Juri found out that due to a bug in tuned the ktimersoftd thread > > had a lower priority than the vcpu thread, so ktimersoftd was getting stuck. > > That's why there was a delta, and that's why the tsc was being marked as > > unstable. Now that bug is fixed, we don't see the tsc being marked as > > unstable anymore. > > > > However, having the timer firing every few seconds does increase the > > latency. That's the problem we want to fix next. What I'm hearing is > > that we have two options: > > > > 1. Use tsc=reliable in the profiles. To do that, we have to understand > > all the implications of having tsc=reliable (ie. what the kernel does > > or doesn't do) > > I'll look into this. > > > 2. Make clocksource watchdog respect a CPU mask provided by the user > > And already started asking upstream if this would be a sensible and > viable option. First reply was however that this might not work, as > checking for stability is an all-or-nothing thing. Will try to get > a definitive answer from experts (Thomas for example). Please mention the latency numbers you get with and without cs watchdog on the isolated CPUs... and then mention NFV wants < 20us. > > What do you guys think we should do? My only concern about item 2 is to > > get stuck on upstream for too long. Maybe we could use tsc=reliable > > as a workaround meanwhile? Testing 24 hours cyclictest, we have finished 2 testing scenarios on 2 standard kernel-rt versions up to now. Note: (1) Adding tsc=reliable to both host and guest kernel line (2) # cat /sys/module/kvm/parameters/lapic_timer_advance_ns 3000 ==With kernel-rt-4.18.0-74.rt9.130.el8.x86_64, max latency is 52us. Beaker job: https://beaker.engineering.redhat.com/jobs/3374286 (1)Single VM with 1 rt vCPU # Min Latencies: 00006 # Avg Latencies: 00013 # Max Latencies: 00049 (3)Multiple VMs each with 1 rt vCPU: -VM1 # Min Latencies: 00005 # Avg Latencies: 00012 # Max Latencies: 00033 -VM2 # Min Latencies: 00005 # Avg Latencies: 00012 # Max Latencies: 00052 -VM3 # Min Latencies: 00006 # Avg Latencies: 00012 # Max Latencies: 00030 -VM4 # Min Latencies: 00005 # Avg Latencies: 00012 # Max Latencies: 00031 ==With kernel-rt-4.18.0-71.rt9.127.el8.x86_64, max latency is 44us. Beaker job: https://beaker.engineering.redhat.com/jobs/3374285 (1)Single VM with 1 rt vCPU # Min Latencies: 00006 # Avg Latencies: 00011 # Max Latencies: 00031 (3)Multiple VMs each with 1 rt vCPU: # Min Latencies: 00006 # Avg Latencies: 00011 # Max Latencies: 00042 # Min Latencies: 00006 # Avg Latencies: 00011 # Max Latencies: 00029 # Min Latencies: 00006 # Avg Latencies: 00012 # Max Latencies: 00044 # Min Latencies: 00006 # Avg Latencies: 00011 # Max Latencies: 00029 Update the third testing scenario of Comment 67: Note: (1) Adding tsc=reliable to both host and guest kernel line (2) # cat /sys/module/kvm/parameters/lapic_timer_advance_ns 3000 ==With kernel-rt-4.18.0-74.rt9.130.el8.x86_64, max latency is 84us. Beaker job: https://beaker.engineering.redhat.com/jobs/3374286 (3)Single VM with 8 rt vCPUs # Min Latencies: 00005 00014 00014 00014 00014 00014 00014 00011 # Avg Latencies: 00013 00015 00015 00015 00014 00015 00015 00018 # Max Latencies: 00031 00021 00022 00022 00021 00021 00021 00084 ==With kernel-rt-4.18.0-71.rt9.127.el8.x86_64, max latency is 49us. Beaker job: https://beaker.engineering.redhat.com/jobs/3374285 (3)Single VM with 8 rt vCPUs # Min Latencies: 00006 00013 00013 00013 00013 00013 00013 00011 # Avg Latencies: 00011 00014 00014 00014 00014 00014 00014 00014 # Max Latencies: 00049 00042 00042 00042 00043 00042 00042 00043 Update latest testing: max cyclictest latency is 47us. Note: No extra options were set, we just tested following standard testing steps/configurations in this run. Versions: libvirt-5.0.0-7.module+el8+2887+effa3c42.x86_64 tuned-2.10.0-15.el8.noarch kernel-rt-4.18.0-80.rt9.138.el8.x86_64 qemu-kvm-3.1.0-20.module+el8+2888+cdc893a8.x86_64 (1)Single VM with 1 rt vCPU: # Min Latencies: 00006 # Avg Latencies: 00017 # Max Latencies: 00040 (2)Single VM with 8 rt vCPUs: # Min Latencies: 00007 00015 00019 00019 00018 00019 00014 00012 # Avg Latencies: 00016 00019 00019 00019 00019 00019 00019 00019 # Max Latencies: 00038 00028 00027 00027 00028 00031 00027 00044 (3)Multiple VMs each with 1 rt vCPU: - VM1 # Min Latencies: 00007 # Avg Latencies: 00015 # Max Latencies: 00036 - VM2 # Min Latencies: 00007 # Avg Latencies: 00015 # Max Latencies: 00037 - VM3 # Min Latencies: 00007 # Avg Latencies: 00016 # Max Latencies: 00047 - VM4 # Min Latencies: 00007 # Avg Latencies: 00016 # Max Latencies: 00038 Closing this as the root problem for the huge (~200us) spike (and related dmesg warnings) was fixed by https://bugzilla.redhat.com/show_bug.cgi?id=1669294 (userspace component). Going to open a new BZ for tracking the remaining issue related to clocksource watchdog timer causing latencies >40us. |