Bug 1717212
Summary: | KVM tracebacks causing significant latency to VM | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Yichen Wang <yicwang> | |
Component: | kernel-rt | Assignee: | Peter Xu <peterx> | |
kernel-rt sub component: | KVM | QA Contact: | Pei Zhang <pezhang> | |
Status: | CLOSED ERRATA | Docs Contact: | ||
Severity: | high | |||
Priority: | high | CC: | bhu, chayang, chhudson, dhoward, jraju, juzhang, kabbott, lcapitulino, lgoncalv, mtosatti, pbonzini, peterx, pezhang, qzhang, virt-maint | |
Version: | 7.7 | Keywords: | ZStream | |
Target Milestone: | rc | |||
Target Release: | --- | |||
Hardware: | x86_64 | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | kernel-rt-3.10.0-1057.rt56.1017.el7 | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1719525 1728176 (view as bug list) | Environment: | ||
Last Closed: | 2019-08-06 12:36:32 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | 1719525 | |||
Bug Blocks: | 1655694, 1728176 |
Description
Yichen Wang
2019-06-04 22:31:52 UTC
Hi Yichen, I think the spike is caused because of the warning (since we it call printk(), etc). So, our only issue may very well be the warning itself. Pei, would be nice to gather more info and try to reproduce in our lab. There are more information asked by the engineer assigned to that case: https://access.redhat.com/support/cases/#/case/02397696. dmesg logs, and sosreport, etc. Not sure if you all can access. Please let me know the preferred way you guys want to work on this, so we all can stick to one thread. Thanks very much! Regards, Yichen (In reply to Luiz Capitulino from comment #2) > Hi Yichen, > > I think the spike is caused because of the warning (since we it call > printk(), etc). > So, our only issue may very well be the warning itself. > > Pei, would be nice to gather more info and try to reproduce in our lab. Luiz, I'm trying to reproduce this issue and update results soon. Best regards, Pei Hi, Pei/Luiz, Just want to give some updates. I can't really reproduce it on my testbed for the past 3 days, and things just went fine. I am requesting some information from customer on what they do in vcpu2 and hopefully that will help. Meanwhile, I am not sure if you guys have any clue by reading the dumps? Thanks very much! Regards, Yichen (In reply to Yichen Wang from comment #8) > Hi, Pei/Luiz, > > Just want to give some updates. I can't really reproduce it on my testbed > for the past 3 days, and things just went fine. I am requesting some > information from customer on what they do in vcpu2 and hopefully that will > help. Meanwhile, I am not sure if you guys have any clue by reading the > dumps? Hi, Yichen, I think we've found the issue behind, we probably need to backport 2 commits (5753743fa510, 81b016676e1c). (I'm copy & paste some discussion context below) On Sun, Jun 09, 2019 at 09:19:09PM +0800, Peter Xu wrote: > On Thu, Jun 06, 2019 at 01:31:20PM -0400, Luiz Capitulino wrote: > > Hi Peter, > > Hi, Luiz, > > Sorry for the late reply (Jun 7th was holiday here). > > > > > I just would like to let you know that I've assigned a high priority > > customer BZ to you: > > > > Bug 1717212 - KVM tracebacks causing significant latency to VM > > (https://bugzilla.redhat.com/show_bug.cgi?id=1717212) > > > > I've taken a quick look at this issue, and it seems to me that the > > almost 4ms spike is caused because of the WARNING that's being > > triggered. That is, if the warning is triggered in the cyclictest > > code path, then it is going to be this huge since we call printk() etc. > > > > So, I think a good first step is to understand why this warning > > is being triggered. I can see that its called from > > kvm_vcpu_trigger_posted_interrupt() with this comment: > > > > /* > > * Currently, we don't support urgent interrupt, > > * all interrupts are recognized as non-urgent > > * interrupt, so we cannot post interrupts when > > * 'SN' is set. > > > > I know nothing about posted interrupt support, but this raises > > a few questions for me: is it the guest that's using posted > > interrupt? > > IMHO yes, otherwise it can't go into that stack. > > To validate whether the delay was caused by the printk(), maybe we can > ask the customer to disable PI to see whether the delay will trigger > again? IIUC unsetting kvm_intel.enable_apicv module parameter should > do. > > > If yes, then would a RHEL7.6 guest do it (note that they > > are using CentOS, and this is not supported). Being the guest > > or the host kernel issuing the posted interrupt, why SN is set? > > Maybe the warning can just be dropped? > > Yes I think that's because we didn't backport this commit from > upstream to RHEL7 (and RHEL7-rt, for our case): > > 5753743fa510 ("KVM: VMX: remove WARN_ON_ONCE in kvm_vcpu_trigger_posted_interrupt", 2017-09-19) > > Though, one thing I don't know here is even without 5753743fa510 I > still cannot understand why that WARN_ON_ONCE could be triggered after > all (after we picked up dc91f2eb1a40 which seems to really fix a race > on SN bit). That warning dumped basically means the guest has had a > time when it has both these conditions met: > > (vcpu->mode == IN_GUEST_MODE) && (SN == 1) > > However in vcpu_enter_guest() we have this pattern: > > vcpu_enter_guest() > { > ... > vcpu->mode = IN_GUEST_MODE; [1] > ... > vcpu->run(); > ... > vcpu->mode = OUTSIDE_GUEST_MODE; [2] > } > > What I cannot understand is why SN can be set to 1 during [1] and [2] > (since that's the only place we can have vcpu->mode == IN_GUEST_MODE > AFAIU...). The problem is that IIUC the only place we'll set SN=1 is > in kvm_sched_out() -> vmx_vcpu_pi_put() where we want to schedule the > vcpu thread out of the physical CPU, and I don't see how that can > happen between [1] and [2]. > > Paolo/Marcelo, could any of you help me to understand what I've > missed? > > > > > Another point is that Jaison Raju from support is asking some > > questions in comment 5 about memballoon. While I think it's worth > > to check this configuration, I think this comment is kind of > > "shooting in all directions". I'd suggest we stay focused on > > debugging the warning issue first and follow up from there. > > And yes I totally agree we should start with this warning first. > > Thanks, > > -- > Peter Xu Peter, vmx_vcpu_pi_put() can leave the SN bit set (in the cmpxchg case of current RHEL). This one has not been backported: commit 81b016676e1c8f58027bd4d2b1d8a981776b36fe Author: Luwei Kang <luwei.kang> Date: Thu Jan 31 16:52:02 2019 +0800 KVM: x86: Sync the pending Posted-Interrupts Some Posted-Interrupts from passthrough devices may be lost or overwritten when the vCPU is in runnable state. The SN (Suppress Notification) of PID (Posted Interrupt Descriptor) will be set when the vCPU is preempted (vCPU in KVM_MP_STATE_RUNNABLE state but not running on physical CPU). If a posted interrupt coming at this time, the irq remmaping facility will set the bit of PIR (Posted Interrupt Requests) without ON (Outstanding Notification). So this interrupt can't be sync to APIC virtualization register and will not be handled by Guest because ON is zero. Signed-off-by: Luwei Kang <luwei.kang> [Eliminate the pi_clear_sn fast path. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini> (In reply to Peter Xu from comment #9) > (In reply to Yichen Wang from comment #8) > > Hi, Pei/Luiz, > > > > Just want to give some updates. I can't really reproduce it on my testbed > > for the past 3 days, and things just went fine. I am requesting some > > information from customer on what they do in vcpu2 and hopefully that will > > help. Meanwhile, I am not sure if you guys have any clue by reading the > > dumps? > > Hi, Yichen, > > I think we've found the issue behind, we probably need to backport 2 commits > (5753743fa510, 81b016676e1c). > > (I'm copy & paste some discussion context below) > > On Sun, Jun 09, 2019 at 09:19:09PM +0800, Peter Xu wrote: > > On Thu, Jun 06, 2019 at 01:31:20PM -0400, Luiz Capitulino wrote: > > > Hi Peter, > > > > Hi, Luiz, > > > > Sorry for the late reply (Jun 7th was holiday here). > > > > > > > > I just would like to let you know that I've assigned a high priority > > > customer BZ to you: > > > > > > Bug 1717212 - KVM tracebacks causing significant latency to VM > > > (https://bugzilla.redhat.com/show_bug.cgi?id=1717212) > > > > > > I've taken a quick look at this issue, and it seems to me that the > > > almost 4ms spike is caused because of the WARNING that's being > > > triggered. That is, if the warning is triggered in the cyclictest > > > code path, then it is going to be this huge since we call printk() etc. > > > > > > So, I think a good first step is to understand why this warning > > > is being triggered. I can see that its called from > > > kvm_vcpu_trigger_posted_interrupt() with this comment: > > > > > > /* > > > * Currently, we don't support urgent interrupt, > > > * all interrupts are recognized as non-urgent > > > * interrupt, so we cannot post interrupts when > > > * 'SN' is set. > > > > > > I know nothing about posted interrupt support, but this raises > > > a few questions for me: is it the guest that's using posted > > > interrupt? > > > > IMHO yes, otherwise it can't go into that stack. > > > > To validate whether the delay was caused by the printk(), maybe we can > > ask the customer to disable PI to see whether the delay will trigger > > again? IIUC unsetting kvm_intel.enable_apicv module parameter should > > do. > > > > > If yes, then would a RHEL7.6 guest do it (note that they > > > are using CentOS, and this is not supported). Being the guest > > > or the host kernel issuing the posted interrupt, why SN is set? > > > Maybe the warning can just be dropped? > > > > Yes I think that's because we didn't backport this commit from > > upstream to RHEL7 (and RHEL7-rt, for our case): > > > > 5753743fa510 ("KVM: VMX: remove WARN_ON_ONCE in kvm_vcpu_trigger_posted_interrupt", 2017-09-19) > > > > Though, one thing I don't know here is even without 5753743fa510 I > > still cannot understand why that WARN_ON_ONCE could be triggered after > > all (after we picked up dc91f2eb1a40 which seems to really fix a race > > on SN bit). That warning dumped basically means the guest has had a > > time when it has both these conditions met: > > > > (vcpu->mode == IN_GUEST_MODE) && (SN == 1) > > > > However in vcpu_enter_guest() we have this pattern: > > > > vcpu_enter_guest() > > { > > ... > > vcpu->mode = IN_GUEST_MODE; [1] > > ... > > vcpu->run(); > > ... > > vcpu->mode = OUTSIDE_GUEST_MODE; [2] > > } > > > > What I cannot understand is why SN can be set to 1 during [1] and [2] > > (since that's the only place we can have vcpu->mode == IN_GUEST_MODE > > AFAIU...). The problem is that IIUC the only place we'll set SN=1 is > > in kvm_sched_out() -> vmx_vcpu_pi_put() where we want to schedule the > > vcpu thread out of the physical CPU, and I don't see how that can > > happen between [1] and [2]. > > > > Paolo/Marcelo, could any of you help me to understand what I've > > missed? > > > > > > > > Another point is that Jaison Raju from support is asking some > > > questions in comment 5 about memballoon. While I think it's worth > > > to check this configuration, I think this comment is kind of > > > "shooting in all directions". I'd suggest we stay focused on > > > debugging the warning issue first and follow up from there. > > > > And yes I totally agree we should start with this warning first. > > > > Thanks, > > > > -- > > Peter Xu > > Peter, > > vmx_vcpu_pi_put() can leave the SN bit set (in the cmpxchg case of > current RHEL). > > This one has not been backported: > > commit 81b016676e1c8f58027bd4d2b1d8a981776b36fe > Author: Luwei Kang <luwei.kang> > Date: Thu Jan 31 16:52:02 2019 +0800 > > KVM: x86: Sync the pending Posted-Interrupts > Some Posted-Interrupts from passthrough devices may be lost or > overwritten when the vCPU is in runnable state. > > The SN (Suppress Notification) of PID (Posted Interrupt Descriptor) > will > be set when the vCPU is preempted (vCPU in KVM_MP_STATE_RUNNABLE > state > but not running on physical CPU). If a posted interrupt coming at > this > time, the irq remmaping facility will set the bit of PIR (Posted > Interrupt Requests) without ON (Outstanding Notification). > So this interrupt can't be sync to APIC virtualization register and > will not be handled by Guest because ON is zero. > > Signed-off-by: Luwei Kang <luwei.kang> > [Eliminate the pi_clear_sn fast path. - Paolo] > Signed-off-by: Paolo Bonzini <pbonzini> Hi, Pei, Thanks a lot for sharing the discussion. I don't really know much about the Posted Interrupt, but seems we've found the patches which may fix the issue. :) One thing caught my attention in the discussion is, the function "kvm_sched_out() -> vmx_vcpu_pi_put() where we want to schedule the vcpu thread out of the physical CPU". I believe in our setup we are doing a cpu pinning, and guest VM has a "idle=poll" in GRUB. So the only possibility that I can think of is for scheduling out the vcpu thread is to deliver the timer interrupt, am I right? Is there any possible use case which will trigger this stack on host? I am still waiting on the reply from customer to understand the activities that they are doing in vcpu2. Thanks very much! Regards, Yichen Forgot to mention one more thing that caught my attention. I know that customer does have DPDK-IO on a SRIOV VF (i40evf). As SRIOV is a passthrough, so I would say the patch 81b016676e1c8f58027bd4d2b1d8a981776b36fe will likely to fix the issue. Just my two cents... Thanks very much! Regards, Yichen Hi, Yichen, (In reply to Yichen Wang from comment #10) > One thing caught my attention in the discussion is, the function > "kvm_sched_out() -> vmx_vcpu_pi_put() where we want to schedule the vcpu > thread out of the physical CPU". I believe in our setup we are doing a cpu > pinning, and guest VM has a "idle=poll" in GRUB. So the only possibility > that I can think of is for scheduling out the vcpu thread is to deliver the > timer interrupt, am I right? Is there any possible use case which will > trigger this stack on host? That's only some technical discussion about the commit itself rather than any discussion about the bug we've encountered, so please feel free to ignore that paragraph. For the backtrace in the bug, the vcpu should be writting to one L2APIC MSR (I think it's an IPI) and it triggered a warning, which dragged things down. Peter (In reply to Peter Xu from comment #12) > Hi, Yichen, > > (In reply to Yichen Wang from comment #10) > > One thing caught my attention in the discussion is, the function > > "kvm_sched_out() -> vmx_vcpu_pi_put() where we want to schedule the vcpu > > thread out of the physical CPU". I believe in our setup we are doing a cpu > > pinning, and guest VM has a "idle=poll" in GRUB. So the only possibility > > that I can think of is for scheduling out the vcpu thread is to deliver the > > timer interrupt, am I right? Is there any possible use case which will > > trigger this stack on host? > > That's only some technical discussion about the commit itself rather than > any discussion about the bug we've encountered, so please feel free to > ignore that paragraph. > > For the backtrace in the bug, the vcpu should be writting to one L2APIC MSR > (I think it's an IPI) and it triggered a warning, which dragged things down. > > Peter Hi, Peter, Thanks a lot for your reply. When plain cyclictest was running, I see below messages coming to host dmesg pretty frequently: [67683.477201] kvm_get_msr_common: 53 callbacks suppressed [67683.477203] kvm [289365]: vcpu12 ignored rdmsr: 0x4e [67683.491748] kvm_set_msr_common: 20 callbacks suppressed [67683.491749] kvm [289365]: vcpu12 ignored wrmsr: 0x4e data 2 [67683.506904] kvm [289365]: vcpu12 ignored rdmsr: 0x4e [67683.535912] kvm [289365]: vcpu13 ignored rdmsr: 0x4e [67683.542900] kvm [289365]: vcpu13 ignored wrmsr: 0x4e data 2 [67683.550429] kvm [289365]: vcpu13 ignored rdmsr: 0x4e [67683.564607] kvm [289365]: vcpu14 ignored rdmsr: 0x4e [67683.571429] kvm [289365]: vcpu14 ignored wrmsr: 0x4e data 2 [67683.578785] kvm [289365]: vcpu14 ignored rdmsr: 0x4e [67683.607474] kvm [289365]: vcpu15 ignored rdmsr: 0x4e [67683.614213] kvm [289365]: vcpu15 ignored wrmsr: 0x4e data 2 [67683.621569] kvm [289365]: vcpu15 ignored rdmsr: 0x4e [67683.634478] kvm [289365]: vcpu16 ignored rdmsr: 0x4e [67683.641148] kvm [289365]: vcpu16 ignored wrmsr: 0x4e data 2 [67683.648429] kvm [289365]: vcpu16 ignored rdmsr: 0x4e [67683.677078] kvm [289365]: vcpu17 ignored wrmsr: 0x4e data 2 [67689.276957] kvm_get_msr_common: 13 callbacks suppressed [67689.276959] kvm [289365]: vcpu0 ignored rdmsr: 0x611 [67689.290618] kvm [289365]: vcpu0 ignored rdmsr: 0x639 [67689.297173] kvm [289365]: vcpu0 ignored rdmsr: 0x641 [67689.303705] kvm [289365]: vcpu0 ignored rdmsr: 0x619 [67689.326863] kvm [289365]: vcpu0 ignored rdmsr: 0x611 [67689.333358] kvm [289365]: vcpu0 ignored rdmsr: 0x639 [67689.339760] kvm [289365]: vcpu0 ignored rdmsr: 0x641 [67689.346084] kvm [289365]: vcpu0 ignored rdmsr: 0x619 [67689.362541] kvm [289365]: vcpu0 ignored rdmsr: 0x611 [67689.368786] kvm [289365]: vcpu0 ignored rdmsr: 0x639 We have below parameters on top of default ko parameters: options kvm ignore_msrs=Y options kvm kvmclock_periodic_sync=0 options kvm_intel nested=Y options kvm_intel ple_window=0 options kvm_intel ple_gap=0 However, I remembered we saw this long time ago, but it doesn't seem to create big trouble for us, so we kind of ignroed. Not sure if this implies anything. Also, just receive the information from customer. Quote: "It is another application worker thread in a tight loop which is mostly user space but with some system calls (like pipe write, sem post, sched_yield, mutex/spinlock). All IRQ’s in the system are pinned to non-isolated cores using irqaffinity in the kernel command line. So in theory no IRQ’s should land on this vcpu 2." Hope above helps. Thanks very much! Regards, Yichen (In reply to Yichen Wang from comment #13) > When plain cyclictest was running, I see below > messages coming to host dmesg pretty frequently: > [67683.477201] kvm_get_msr_common: 53 callbacks suppressed > [67683.477203] kvm [289365]: vcpu12 ignored rdmsr: 0x4e > [67683.491748] kvm_set_msr_common: 20 callbacks suppressed > [67683.491749] kvm [289365]: vcpu12 ignored wrmsr: 0x4e data 2 > [67683.506904] kvm [289365]: vcpu12 ignored rdmsr: 0x4e > [67683.535912] kvm [289365]: vcpu13 ignored rdmsr: 0x4e > [67683.542900] kvm [289365]: vcpu13 ignored wrmsr: 0x4e data 2 > [67683.550429] kvm [289365]: vcpu13 ignored rdmsr: 0x4e > [67683.564607] kvm [289365]: vcpu14 ignored rdmsr: 0x4e These errors reminded me about commit fab0aa3b776f which should also help for perf critical tasks, then we can turn report_ignored_msrs completely off. Maybe I can add that into the backport too? Paolo/Marcelo? (In reply to Yichen Wang from comment #8) > Just want to give some updates. I can't really reproduce it on my testbed > for the past 3 days, and things just went fine. I am requesting some > information from customer on what they do in vcpu2 and hopefully that will > help. Yichen & Pei, In parallel with the backporting - please let us know if any of you (or the customer) can reproduce the problem somewhere. Update: I didn't reproduce this issue after 2 days cyclictest yet. The cyclictest test is still ongoing. Versions(Same with Description): 3.10.0-957.12.1.rt56.927.el7.x86_64 qemu-img-rhev-2.12.0-18.el7_6.1.x86_64 Update: I can't either for 3 days cyclictest. So it must be customer's special workload that triggers the kdump. However when I talked to the customer, even they are not seeing that often, it is like every few days. When it happens, their VM also crash because missing the realtime-ness. Even they are having a hard time to reproduce, and all needed information (sosreport, logs, workload explaination) they've shared to us already, and we don't know what we can ask for more. Maybe when we backported the patches, we can give a trail version for customer to try out again to see if they can reproduce it. Thanks very much! Regards, Yichen Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2019:2043 |