Bug 1447938
Summary: | Boot single VM with 8 rt vCPUs, the latency value is high | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Pei Zhang <pezhang> |
Component: | tuned | Assignee: | Jaroslav Škarvada <jskarvad> |
Status: | CLOSED ERRATA | QA Contact: | Tereza Cerna <tcerna> |
Severity: | unspecified | Docs Contact: | |
Priority: | unspecified | ||
Version: | 7.4 | CC: | atragler, bhu, chayang, hhuang, jeder, jmario, jskarvad, juzhang, knoel, lcapitulino, michen, mtessun, pezhang, sgordon, tcerna, thozza, virt-maint, williams |
Target Milestone: | rc | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | tuned-2.8.0-4.el7 | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2017-08-01 12:35:21 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1240765 |
Description
Pei Zhang
2017-05-04 09:28:13 UTC
Close to 1ms! I have some questions: - You seem to have lots of pCPUs, are you sure you don't have Hyper Threading enabled in the BIOS? - Did the same test-case (same number of vCPUs, same duration etc) work before kernel -581 without a spike? - Does this need 12 hours to reproduce? - Have you tried to reproduce with less than 8 RT vCPUs? I'll try to reproduce this today. While trying to reproduce this issue I was able to get max=30us in one of the vCPUs. When debugging it I found that rteval was pinning load processes to isolated CPUs (this is bug 1448201). It's hard to tell whether or not that rteval issue has anything to do with this spike. We'll have to wait for a fixed rteval and try to reproduce again. It's also a bit unexpected that even with that rteval issue cyclictest gives good results most of the time. (In reply to Luiz Capitulino from comment #6) > While trying to reproduce this issue I was able to get max=30us in one of > the vCPUs. When debugging it I found that rteval was pinning load processes > to isolated CPUs (this is bug 1448201). Luiz, seems you have reproduced and have found the reason. I'm not sure if my reproduce is still needed to confirm that this issue is reproducible or is occasional. Today I tried 1 hour, 2 hours and 3 hours, using 8 rt vCPUs, and latency value looks good. Now I'm testing 12 hours again. If the reproduce is not needed anymore, please let me know. Thanks, Pei > It's hard to tell whether or not that rteval issue has anything to do with > this spike. We'll have to wait for a fixed rteval and try to reproduce again. > > It's also a bit unexpected that even with that rteval issue cyclictest gives > good results most of the time. Here's my 12h result without rteval (so without any load whatsoever): # Min Latencies: 00005 00005 00005 00005 00005 00005 00005 00005 # Avg Latencies: 00005 00005 00005 00005 00005 00005 00005 00005 # Max Latencies: 00011 00010 00010 00010 00010 00010 00010 00010 Yesterday, I was able to get one of the vCPUs close to max=30 when running for 1 hour w/ rteval. Now, we can't tell if it's because of rteval pinning bug or not. I did another where I run a kernel build on the housekeeping CPUs instead of running rteval. After a few tries, I got this (1 hour run): # Min Latencies: 00003 00005 00005 00005 00005 00005 00005 00005 # Avg Latencies: 00005 00005 00005 00005 00005 00005 00005 00005 # Max Latencies: 00010 00010 00010 00021 00034 00010 00028 00010 So, there's a problem besides rteval. I'll debug it. I think I might have found the problem. In my debugging I was seeing two things: 1. Increasing the CPU count makes the problem worse 2. When tracing, I saw this: guest-trace.dat: stress-8443 [002] 3369.175510046: (+542) function: tick_sched_handle.isra.18 <-- tick_sched_timer guest-trace.dat: stress-8447 [008] 3369.175510050: (+4) function: tick_sched_handle.isra.18 <-- tick_sched_timer guest-trace.dat: stress-8448 [003] 3369.175510051: (+1) function: tick_sched_handle.isra.18 <-- tick_sched_timer guest-trace.dat: stress-8445 [009] 3369.175510052: (+1) function: tick_sched_handle.isra.18 <-- tick_sched_timer guest-trace.dat: stress-8449 [006] 3369.175510053: (+1) function: tick_sched_handle.isra.18 <-- tick_sched_timer guest-trace.dat: stress-8444 [007] 3369.175510053: (+0) function: tick_sched_handle.isra.18 <-- tick_sched_timer guest-trace.dat: stress-8446 [004] 3369.175510055: (+2) function: tick_sched_handle.isra.18 <-- tick_sched_timer guest-trace.dat: stress-8442 [005] 3369.175510057: (+2) function: tick_sched_handle.isra.18 <-- tick_sched_timer In this trace we can see that tick_sched_handle() is being called almost at the same time in all CPUs. The two points above seem to indicate that we may be running into lock contention due to the fact that the tick is ticking at the same time on all CPUs. This is a known problem and is avoided with skew_tick=1. Indeed, I can't reproduce the problem when I boot host and guest with skew_tick=1. My reproducer consists of running the 1h test-case a few times. I'd always get the results from comment 10 in the second or third run. However, with skew_tick=1 I'm not able to reproduce this for a whole day. Also, running the test-case for 15 hours I got: # Min Latencies: 00003 00005 00005 00005 00005 00005 00005 00005 # Avg Latencies: 00005 00005 00005 00005 00005 00005 00005 00005 # Max Latencies: 00010 00010 00010 00010 00010 00010 00010 00010 I never got this good result before when running with 6 RT vCPUs, I'd always get one CPU doing max=21us. Now, when running for one hour I did get max=17us only once. While this is higher than the other CPUs, we are still below max=20 (our treshold). I did some additional tests with skew_tick=1 and it really seems to fix the issue for me. So, I'm moving this to tuned as this can be addressed in the RT profiles. I'm also clearing the Regression keyword, as this issue has always existed. Jaroslav, I'm going to post a patch for this issue. I'm waiting input from the RT team in order to know if I add this to KVM-RT profiles only or to the main RT profile too. Does it need to be conditionalized on the number of cpus or sockets or does it belong everywhere? What about non-RT kernel scenarios? (In reply to Jeremy Eder from comment #14) > Does it need to be conditionalized on the number of cpus or sockets or does > it belong everywhere? I _guess_ it won't hurt small boxes, but as Clark had the same question, I'll look more into this. > What about non-RT kernel scenarios? I think anything that's very sensible to jitter may benefit. One good example is zero-loss (I'll open a separate BZ for it). But the kernel docs mention that a drawback is higher power consupmtion (which doesn't matter much for KVM-RT and zero-loss). Propose adding this to the network-latency profile then, which currently has the following inheritance: - cpu-partitioning - realtime - and itself of course If we need it to only be set when cores/sockets are greater than X, then perhaps we should add that detection capability to tuned infrastructure to be leveraged by the profiles? Contingent on Luis investigation. We will also need to clearly document this feature as a comment in the network-latency tuned profile, because I am sure we will get questions. - What exactly does it do, in plain english - How does it relate to nohz_full - When should it be set and when not (if I use nohz_full should I avoid skew_tick or similar). Does this sound reasonable? Yes, it does! Just a note that the profiles are not yet making use of kernel command-line inheritance, so I'll have to change that first and this makes me nervous as a small mistake can cause a big breakage. On the other hand, adding the same option to all those profiles is not an option. I did some additional research on this: read more code, understood the history behind skew_tick and talked a bit with skew_tick author Mike Galbraith. Two conclusions: 1. skew_tick=1 is a MUST HAVE for RT and KVM-RT 2. The benefits on non-RT are not entirely clear to me, but according to Mike they do exist Here's the main problem: by default all CPUs will tick at the exactly same time. In the RT kernel, jiffies_lock is a raw_spinlock. All CPUs usually read jiffies from the tick handler, so they will be stuck waiting for each other. skew_tick=1 solves this because it causes each CPU to tick a different times, avoiding reading jiffies at the same time. The only theoretical problem with skew_tick is that it may cause power consumption to increase, but this is disputed. On upstream and on the RHEL kernel jiffies_lock is not a spinlock it's a seqlock. So I'd expect the situation to be a lot less worse than RT, but still some contention is possible. Also, it seems there's two additional sources of contention when all CPUs tick at the same time: o RCU. See upstream commit 02482880092, which advises for skew_tick=1 due to RCU lock contention. Not sure whether this affects RHEL7 though o hrtimer's cpu_base lock is a raw spinlock, so I guess this could run into contention too? Now, I'm still trying to decide if I add skew_tick=1: 1. Only to RT and KVM-RT profiles for now; or 2. Add it to the network-latency profile and re-work all the profiles including it to make use of kernel command-line inheritance Item 2 is the Right Thing, but I'm afraid it's a intrusive change as I'll have to change several profiles and I don't know how much kernel command-line inheritance has been used so far as it's a new feature. Opnions? I've decided to add skew_tick=1 to the RT profiles first. I'll open a new BZ to discuss its inclusion in non-RT profiles. Here's the BZ for adding skew_tick=1 in non-RT profiles: Bug 1451073 - tuned: use skew_tick=1 in all low-latency profiles @pezhang Thank you you will test it. This helps me a log. I'm adding qa_ack for this bug. Hi Pei, I see everything is ready for testing. Can you do some functional test? Regards, Tereza (In reply to Tereza Cerna from comment #32) > Hi Pei, > > I see everything is ready for testing. Can you do some functional test? I'm doing latency testing now, it needs to run 12 hours, we can get the results by tomorrow. Actually I tested with tuned-2.8.0-5.el7.noarch and latest kernel-rt yesterday, however after running 10 hours later the guest become hang, hit existed bug[1]: [1]Bug 1448770 - INFO: task xfsaild/dm-2:1175 blocked for more than 600 seconds Now I'm trying old kernel-rt version (3.10.0-663.rt56.581.el7.x86_64) and tuned-2.8.0-5.el7.noarch. I need to confirm: what do you mean 'functional test'? If you mean functional of tuned itself, please see Comment 31. Regards, Pei > Regards, > Tereza Hi Pei, I need you to test your component with new version of tuned. If it works well. And if it is working well together. Thank you so much for your effort. Update: Latency testing results looks good with tuned-2.8.0-5.el7.noarch. Versions: 3.10.0-663.rt56.581.el7.x86_64 tuned-2.8.0-5.el7.noarch libvirt-3.2.0-10.el7.x86_64 qemu-kvm-rhev-2.9.0-10.el7.x86_64 Latency results after running 12 hours: # Min Latencies: 00004 00006 00006 00006 00005 00006 00006 00006 # Avg Latencies: 00006 00006 00006 00006 00006 00006 00006 00006 # Max Latencies: 00012 00011 00011 00011 00011 00011 00011 00011 Hi Pei. Thank you so much for your testing. According to testing from #35, the bug is switched to VERIFIED. Everything looks well, patch was applied. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:2102 tuned-2.9.0-0.1.rc1.fc26 has been submitted as an update to Fedora 26. https://bodhi.fedoraproject.org/updates/FEDORA-2017-d9c6b990df tuned-2.9.0-1.fc27 has been submitted as an update to Fedora 27. https://bodhi.fedoraproject.org/updates/FEDORA-2017-0e45ce4685 tuned-2.9.0-1.fc26 has been submitted as an update to Fedora 26. https://bodhi.fedoraproject.org/updates/FEDORA-2017-c30e9bd1ea |