Bug 1717212

Summary:	KVM tracebacks causing significant latency to VM
Product:	Red Hat Enterprise Linux 7	Reporter:	Yichen Wang <yicwang>
Component:	kernel-rt	Assignee:	Peter Xu <peterx>
kernel-rt sub component:	KVM	QA Contact:	Pei Zhang <pezhang>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	high
Priority:	high	CC:	bhu, chayang, chhudson, dhoward, jraju, juzhang, kabbott, lcapitulino, lgoncalv, mtosatti, pbonzini, peterx, pezhang, qzhang, virt-maint
Version:	7.7	Keywords:	ZStream
Target Milestone:	rc
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	kernel-rt-3.10.0-1057.rt56.1017.el7	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1719525 1728176 (view as bug list)		Environment:
Last Closed:	2019-08-06 12:36:32 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1719525
Bug Blocks:	1655694, 1728176

Description Yichen Wang 2019-06-04 22:31:52 UTC

Description of problem:
VM runs cyclictest, and randomly hitting spikes up to 3800us, which is not acceptable. When the 

Version-Release number of selected component (if applicable):
Kernel: 3.10.0-957.12.1.rt56.927.el7.x86_64
KVM: qemu-kvm-rhev-2.12.0-18.el7_6.1.x86_64

How reproducible:
Randomly, every few days

Steps to Reproduce:
1. Launch a VM off OSP13 OpenStack, flavor:
openstack flavor set m1.medium --property "hw:cpu_policy"="dedicated"
openstack flavor set m1.medium --property "hw:mem_page_size"="2048"
openstack flavor set m1.medium --property "hw:cpu_realtime"="yes"
openstack flavor set m1.medium --property "hw:cpu_realtime_mask"="^0"
openstack flavor set m1.medium --property "hw:emulator_threads_policy"="share"
openstack flavor set m1.medium --property "hw:cpu_thread_policy"="require"

2. Run cyclictest -S -m -n -p99 -d0 -A ffff

3. Kdumps from host OS, and seeing latency spikes.

Expected results:
No Kdumps, no latency spike

Additional info:
tuned is applied with profile realtime-nfv-host.
guest is CentOS 7.6 RT

Dumps shown on host host:
[Sat Jun  1 18:00:56 2019] ------------[ cut here ]------------
[Sat Jun  1 18:00:56 2019] WARNING: CPU: 27 PID: 40525 at arch/x86/kvm/vmx.c:5085 vmx_deliver_posted_interrupt+0x156/0x170 [kvm_intel]
[Sat Jun  1 18:00:56 2019] Modules linked in: vhost_net vhost macvtap macvlan tun ip6table_raw xt_comment xt_physdev iptable_raw sr_mod cdrom joydev cdc_ether usbnet mii uas usb_storage ebtable_filter ebtables devlink veth vfio_pci vfio_iommu_type1 vfio igb_uio(OE) uio ip6table_filter ip6_tables ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat xt_addrtype iptable_filter xt_conntrack br_netfilter bridge stp llc team_mode_loadbalance team overlay(T) intel_max10(OE) at24 i40evf spi_altera_mod(OE) spi_bitbang_mod(OE) i2c_altera(OE) regmap_mmio_mod(OE) intel_fpga_fme(OE) intel_fpga_afu(OE) fpga_mgr_mod(OE) iTCO_wdt iTCO_vendor_support skx_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr vfat fat
[Sat Jun  1 18:00:56 2019]  ext4
[Sat Jun  1 18:00:56 2019]  mbcache
[Sat Jun  1 18:00:56 2019]  jbd2
[Sat Jun  1 18:00:56 2019]  sg
[Sat Jun  1 18:00:56 2019]  intel_fpga_pci(OE)
[Sat Jun  1 18:00:56 2019]  i2c_i801
[Sat Jun  1 18:00:56 2019]  lpc_ich
[Sat Jun  1 18:00:56 2019]  mei_me
[Sat Jun  1 18:00:56 2019]  mei
[Sat Jun  1 18:00:56 2019]  wmi
[Sat Jun  1 18:00:56 2019]  ipmi_si
[Sat Jun  1 18:00:56 2019]  ipmi_devintf
[Sat Jun  1 18:00:56 2019]  ipmi_msghandler
[Sat Jun  1 18:00:56 2019]  acpi_cpufreq
[Sat Jun  1 18:00:56 2019]  acpi_pad
[Sat Jun  1 18:00:56 2019]  tpm_crb
[Sat Jun  1 18:00:56 2019]  acpi_power_meter
[Sat Jun  1 18:00:56 2019]  openvswitch
[Sat Jun  1 18:00:56 2019]  nf_conntrack_ipv6
[Sat Jun  1 18:00:56 2019]  nf_nat_ipv6
[Sat Jun  1 18:00:56 2019]  nf_conntrack_ipv4
[Sat Jun  1 18:00:56 2019]  nf_defrag_ipv4
[Sat Jun  1 18:00:56 2019]  nf_nat_ipv4
[Sat Jun  1 18:00:56 2019]  nf_defrag_ipv6
[Sat Jun  1 18:00:56 2019]  nf_nat
[Sat Jun  1 18:00:56 2019]  nf_conntrack
[Sat Jun  1 18:00:56 2019]  ip_tables
[Sat Jun  1 18:00:56 2019]  xfs
[Sat Jun  1 18:00:56 2019]  libcrc32c
[Sat Jun  1 18:00:56 2019]  sd_mod
[Sat Jun  1 18:00:56 2019]  crc_t10dif
[Sat Jun  1 18:00:56 2019]  crct10dif_generic
[Sat Jun  1 18:00:56 2019]  crct10dif_pclmul
[Sat Jun  1 18:00:56 2019]  crct10dif_common
[Sat Jun  1 18:00:56 2019]  crc32c_intel
[Sat Jun  1 18:00:56 2019]  i40e(OE)
[Sat Jun  1 18:00:56 2019]  megaraid_sas
[Sat Jun  1 18:00:56 2019]  ptp
[Sat Jun  1 18:00:56 2019]  i2c_algo_bit
[Sat Jun  1 18:00:56 2019]  pps_core
[Sat Jun  1 18:00:56 2019]  drm_kms_helper
[Sat Jun  1 18:00:56 2019]  syscopyarea
[Sat Jun  1 18:00:56 2019]  sysfillrect
[Sat Jun  1 18:00:56 2019]  sysimgblt
[Sat Jun  1 18:00:56 2019]  fb_sys_fops
[Sat Jun  1 18:00:56 2019]  ttm
[Sat Jun  1 18:00:56 2019]  drm 
[Sat Jun  1 18:00:56 2019]  drm_panel_orientation_quirks
[Sat Jun  1 18:00:56 2019]  ahci
[Sat Jun  1 18:00:56 2019]  libahci
[Sat Jun  1 18:00:56 2019]  libata
[Sat Jun  1 18:00:56 2019]  nfit
[Sat Jun  1 18:00:56 2019]  libnvdimm
[Sat Jun  1 18:00:56 2019]  dm_mirror
[Sat Jun  1 18:00:56 2019]  dm_region_hash
[Sat Jun  1 18:00:56 2019]  dm_log
[Sat Jun  1 18:00:56 2019]  dm_mod
 
[Sat Jun  1 18:00:56 2019] CPU: 27 PID: 40525 Comm: CPU 2/KVM Kdump: loaded Tainted: G           OE  ------------ T 3.10.0-957.12.1.rt56.927.el7.x86_64 #1
[Sat Jun  1 18:00:56 2019] Hardware name: Quanta Cloud Technology Inc. QuantaGrid D52BE-2U/S5BE-MB 3UPI (LBG-1G), BIOS 3A11.BT15 02/21/2019
[Sat Jun  1 18:00:56 2019] Call Trace:
[Sat Jun  1 18:00:56 2019]  [<ffffffff91d547cb>] dump_stack+0x19/0x1b
[Sat Jun  1 18:00:56 2019]  [<ffffffff9168751b>] __warn+0xfb/0x120
[Sat Jun  1 18:00:56 2019]  [<ffffffff9168765d>] warn_slowpath_null+0x1d/0x20
[Sat Jun  1 18:00:56 2019]  [<ffffffffc0f44436>] vmx_deliver_posted_interrupt+0x156/0x170 [kvm_intel]
[Sat Jun  1 18:00:56 2019]  [<ffffffffc06cdbf4>] __apic_accept_irq+0x2a4/0x3e0 [kvm]
[Sat Jun  1 18:00:56 2019]  [<ffffffffc06cde2a>] kvm_apic_set_irq+0x2a/0x30 [kvm]
[Sat Jun  1 18:00:56 2019]  [<ffffffffc06cdff3>] kvm_irq_delivery_to_apic_fast+0x163/0x400 [kvm]
[Sat Jun  1 18:00:56 2019]  [<ffffffffc06d2e15>] kvm_irq_delivery_to_apic+0x65/0x2f0 [kvm]
[Sat Jun  1 18:00:56 2019]  [<ffffffffc06a823d>] ? kvm_arch_vcpu_load+0x5d/0x2a0 [kvm]
[Sat Jun  1 18:00:56 2019]  [<ffffffffc068f439>] ? kvm_sched_in+0x39/0x40 [kvm]
[Sat Jun  1 18:00:56 2019]  [<ffffffff916c1c0e>] ? finish_task_switch+0xfe/0x160
[Sat Jun  1 18:00:56 2019]  [<ffffffffc06cd2af>] kvm_lapic_reg_write+0x11f/0x6a0 [kvm]
[Sat Jun  1 18:00:56 2019]  [<ffffffffc06cfc2d>] kvm_x2apic_msr_write+0x5d/0xa0 [kvm]
[Sat Jun  1 18:00:56 2019]  [<ffffffffc06a5e2d>] kvm_set_msr_common+0x29d/0xb60 [kvm]
[Sat Jun  1 18:00:56 2019]  [<ffffffffc0f4947e>] ? vmx_vcpu_run+0x35e/0xa90 [kvm_intel]
[Sat Jun  1 18:00:56 2019]  [<ffffffffc0f49472>] ? vmx_vcpu_run+0x352/0xa90 [kvm_intel]
[Sat Jun  1 18:00:56 2019]  [<ffffffffc0f4947e>] ? vmx_vcpu_run+0x35e/0xa90 [kvm_intel]
[Sat Jun  1 18:00:56 2019]  [<ffffffffc0f4d73a>] vmx_set_msr+0x16a/0x5f0 [kvm_intel]
[Sat Jun  1 18:00:56 2019]  [<ffffffffc0698617>] kvm_set_msr+0x47/0x80 [kvm]
[Sat Jun  1 18:00:56 2019]  [<ffffffffc0f48628>] handle_wrmsr+0x58/0x1b0 [kvm_intel]
[Sat Jun  1 18:00:56 2019]  [<ffffffffc0f4aedc>] vmx_handle_exit+0x16c/0xcf0 [kvm_intel]
[Sat Jun  1 18:00:56 2019]  [<ffffffffc0f49472>] ? vmx_vcpu_run+0x352/0xa90 [kvm_intel]
[Sat Jun  1 18:00:56 2019]  [<ffffffffc0f4947e>] ? vmx_vcpu_run+0x35e/0xa90 [kvm_intel]
[Sat Jun  1 18:00:56 2019]  [<ffffffffc06a73e9>] vcpu_enter_guest+0xa89/0x13f0 [kvm]
[Sat Jun  1 18:00:56 2019]  [<ffffffffc06aed18>] kvm_arch_vcpu_ioctl_run+0x368/0x490 [kvm]
[Sat Jun  1 18:00:56 2019]  [<ffffffffc0693491>] kvm_vcpu_ioctl+0x2d1/0x720 [kvm]
[Sat Jun  1 18:00:56 2019]  [<ffffffff91842d70>] do_vfs_ioctl+0x3a0/0x5a0
[Sat Jun  1 18:00:56 2019]  [<ffffffff9174ae80>] ? __call_rcu+0xb0/0x2a0
[Sat Jun  1 18:00:56 2019]  [<ffffffff917493b0>] ? rcu_eqs_exit+0x50/0xa0
[Sat Jun  1 18:00:56 2019]  [<ffffffff91843011>] SyS_ioctl+0xa1/0xc0
[Sat Jun  1 18:00:56 2019]  [<ffffffff91d662dd>] tracesys+0xa3/0xc9
[Sat Jun  1 18:00:56 2019] ---[ end trace 0000000000000002 ]—

Comment 2 Luiz Capitulino 2019-06-05 12:57:36 UTC

Hi Yichen,

I think the spike is caused because of the warning (since we it call printk(), etc).
So, our only issue may very well be the warning itself.

Pei, would be nice to gather more info and try to reproduce in our lab.

Comment 3 Yichen Wang 2019-06-05 18:24:43 UTC

There are more information asked by the engineer assigned to that case: https://access.redhat.com/support/cases/#/case/02397696. dmesg logs, and sosreport, etc. Not sure if you all can access. Please let me know the preferred way you guys want to work on this, so we all can stick to one thread.

Thanks very much!

Regards,
Yichen

Comment 6 Pei Zhang 2019-06-10 10:19:07 UTC

(In reply to Luiz Capitulino from comment #2)
> Hi Yichen,
> 
> I think the spike is caused because of the warning (since we it call
> printk(), etc).
> So, our only issue may very well be the warning itself.
> 
> Pei, would be nice to gather more info and try to reproduce in our lab.

Luiz, I'm trying to reproduce this issue and update results soon.


Best regards,
Pei

Comment 8 Yichen Wang 2019-06-10 18:15:56 UTC

Hi, Pei/Luiz,

Just want to give some updates. I can't really reproduce it on my testbed for the past 3 days, and things just went fine. I am requesting some information from customer on what they do in vcpu2 and hopefully that will help. Meanwhile, I am not sure if you guys have any clue by reading the dumps?

Thanks very much!

Regards,
Yichen

Comment 9 Peter Xu 2019-06-11 01:44:30 UTC

(In reply to Yichen Wang from comment #8)
> Hi, Pei/Luiz,
> 
> Just want to give some updates. I can't really reproduce it on my testbed
> for the past 3 days, and things just went fine. I am requesting some
> information from customer on what they do in vcpu2 and hopefully that will
> help. Meanwhile, I am not sure if you guys have any clue by reading the
> dumps?

Hi, Yichen,

I think we've found the issue behind, we probably need to backport 2 commits (5753743fa510, 81b016676e1c).

(I'm copy & paste some discussion context below)

On Sun, Jun 09, 2019 at 09:19:09PM +0800, Peter Xu wrote:
> On Thu, Jun 06, 2019 at 01:31:20PM -0400, Luiz Capitulino wrote:
> > Hi Peter,
>
> Hi, Luiz,
>
> Sorry for the late reply (Jun 7th was holiday here).
>
> >
> > I just would like to let you know that I've assigned a high priority
> > customer BZ to you:
> >
> > Bug 1717212 - KVM tracebacks causing significant latency to VM
> > (https://bugzilla.redhat.com/show_bug.cgi?id=1717212)
> >
> > I've taken a quick look at this issue, and it seems to me that the
> > almost 4ms spike is caused because of the WARNING that's being
> > triggered. That is, if the warning is triggered in the cyclictest
> > code path, then it is going to be this huge since we call printk() etc.
> >
> > So, I think a good first step is to understand why this warning
> > is being triggered. I can see that its called from
> > kvm_vcpu_trigger_posted_interrupt() with this comment:
> >
> >      /*
> >       * Currently, we don't support urgent interrupt,
> >       * all interrupts are recognized as non-urgent
> >       * interrupt, so we cannot post interrupts when
> >       * 'SN' is set.
> >
> > I know nothing about posted interrupt support, but this raises
> > a few questions for me: is it the guest that's using posted
> > interrupt?
>
> IMHO yes, otherwise it can't go into that stack.
>
> To validate whether the delay was caused by the printk(), maybe we can
> ask the customer to disable PI to see whether the delay will trigger
> again?  IIUC unsetting kvm_intel.enable_apicv module parameter should
> do.
>
> > If yes, then would a RHEL7.6 guest do it (note that they
> > are using CentOS, and this is not supported). Being the guest
> > or the host kernel issuing the posted interrupt, why SN is set?
> > Maybe the warning can just be dropped?
>
> Yes I think that's because we didn't backport this commit from
> upstream to RHEL7 (and RHEL7-rt, for our case):
>
> 5753743fa510 ("KVM: VMX: remove WARN_ON_ONCE in kvm_vcpu_trigger_posted_interrupt", 2017-09-19)
>
> Though, one thing I don't know here is even without 5753743fa510 I
> still cannot understand why that WARN_ON_ONCE could be triggered after
> all (after we picked up dc91f2eb1a40 which seems to really fix a race
> on SN bit).  That warning dumped basically means the guest has had a
> time when it has both these conditions met:
>
>   (vcpu->mode == IN_GUEST_MODE) && (SN == 1)
>
> However in vcpu_enter_guest() we have this pattern:
>
>   vcpu_enter_guest()
>   {
>     ...
>     vcpu->mode = IN_GUEST_MODE;                  [1]
>     ...
>     vcpu->run();
>     ...
>     vcpu->mode = OUTSIDE_GUEST_MODE;             [2]
>   }
>
> What I cannot understand is why SN can be set to 1 during [1] and [2]
> (since that's the only place we can have vcpu->mode == IN_GUEST_MODE
> AFAIU...).  The problem is that IIUC the only place we'll set SN=1 is
> in kvm_sched_out() -> vmx_vcpu_pi_put() where we want to schedule the
> vcpu thread out of the physical CPU, and I don't see how that can
> happen between [1] and [2].
>
> Paolo/Marcelo, could any of you help me to understand what I've
> missed?
>
> >
> > Another point is that Jaison Raju from support is asking some
> > questions in comment 5 about memballoon. While I think it's worth
> > to check this configuration, I think this comment is kind of
> > "shooting in all directions". I'd suggest we stay focused on
> > debugging the warning issue first and follow up from there.
>
> And yes I totally agree we should start with this warning first.
>
> Thanks,
>
> --
> Peter Xu

Peter,

vmx_vcpu_pi_put() can leave the SN bit set (in the cmpxchg case of
current RHEL).

This one has not been backported:

commit 81b016676e1c8f58027bd4d2b1d8a981776b36fe
Author: Luwei Kang <luwei.kang>
Date:   Thu Jan 31 16:52:02 2019 +0800

    KVM: x86: Sync the pending Posted-Interrupts
    Some Posted-Interrupts from passthrough devices may be lost or
    overwritten when the vCPU is in runnable state.

    The SN (Suppress Notification) of PID (Posted Interrupt Descriptor)
will
    be set when the vCPU is preempted (vCPU in KVM_MP_STATE_RUNNABLE
state
    but not running on physical CPU). If a posted interrupt coming at
this
    time, the irq remmaping facility will set the bit of PIR (Posted
    Interrupt Requests) without ON (Outstanding Notification).
    So this interrupt can't be sync to APIC virtualization register and
    will not be handled by Guest because ON is zero.

    Signed-off-by: Luwei Kang <luwei.kang>
    [Eliminate the pi_clear_sn fast path. - Paolo]
    Signed-off-by: Paolo Bonzini <pbonzini>

Comment 10 Yichen Wang 2019-06-11 07:14:24 UTC

(In reply to Peter Xu from comment #9)
> (In reply to Yichen Wang from comment #8)
> > Hi, Pei/Luiz,
> > 
> > Just want to give some updates. I can't really reproduce it on my testbed
> > for the past 3 days, and things just went fine. I am requesting some
> > information from customer on what they do in vcpu2 and hopefully that will
> > help. Meanwhile, I am not sure if you guys have any clue by reading the
> > dumps?
> 
> Hi, Yichen,
> 
> I think we've found the issue behind, we probably need to backport 2 commits
> (5753743fa510, 81b016676e1c).
> 
> (I'm copy & paste some discussion context below)
> 
> On Sun, Jun 09, 2019 at 09:19:09PM +0800, Peter Xu wrote:
> > On Thu, Jun 06, 2019 at 01:31:20PM -0400, Luiz Capitulino wrote:
> > > Hi Peter,
> >
> > Hi, Luiz,
> >
> > Sorry for the late reply (Jun 7th was holiday here).
> >
> > >
> > > I just would like to let you know that I've assigned a high priority
> > > customer BZ to you:
> > >
> > > Bug 1717212 - KVM tracebacks causing significant latency to VM
> > > (https://bugzilla.redhat.com/show_bug.cgi?id=1717212)
> > >
> > > I've taken a quick look at this issue, and it seems to me that the
> > > almost 4ms spike is caused because of the WARNING that's being
> > > triggered. That is, if the warning is triggered in the cyclictest
> > > code path, then it is going to be this huge since we call printk() etc.
> > >
> > > So, I think a good first step is to understand why this warning
> > > is being triggered. I can see that its called from
> > > kvm_vcpu_trigger_posted_interrupt() with this comment:
> > >
> > >      /*
> > >       * Currently, we don't support urgent interrupt,
> > >       * all interrupts are recognized as non-urgent
> > >       * interrupt, so we cannot post interrupts when
> > >       * 'SN' is set.
> > >
> > > I know nothing about posted interrupt support, but this raises
> > > a few questions for me: is it the guest that's using posted
> > > interrupt?
> >
> > IMHO yes, otherwise it can't go into that stack.
> >
> > To validate whether the delay was caused by the printk(), maybe we can
> > ask the customer to disable PI to see whether the delay will trigger
> > again?  IIUC unsetting kvm_intel.enable_apicv module parameter should
> > do.
> >
> > > If yes, then would a RHEL7.6 guest do it (note that they
> > > are using CentOS, and this is not supported). Being the guest
> > > or the host kernel issuing the posted interrupt, why SN is set?
> > > Maybe the warning can just be dropped?
> >
> > Yes I think that's because we didn't backport this commit from
> > upstream to RHEL7 (and RHEL7-rt, for our case):
> >
> > 5753743fa510 ("KVM: VMX: remove WARN_ON_ONCE in kvm_vcpu_trigger_posted_interrupt", 2017-09-19)
> >
> > Though, one thing I don't know here is even without 5753743fa510 I
> > still cannot understand why that WARN_ON_ONCE could be triggered after
> > all (after we picked up dc91f2eb1a40 which seems to really fix a race
> > on SN bit).  That warning dumped basically means the guest has had a
> > time when it has both these conditions met:
> >
> >   (vcpu->mode == IN_GUEST_MODE) && (SN == 1)
> >
> > However in vcpu_enter_guest() we have this pattern:
> >
> >   vcpu_enter_guest()
> >   {
> >     ...
> >     vcpu->mode = IN_GUEST_MODE;                  [1]
> >     ...
> >     vcpu->run();
> >     ...
> >     vcpu->mode = OUTSIDE_GUEST_MODE;             [2]
> >   }
> >
> > What I cannot understand is why SN can be set to 1 during [1] and [2]
> > (since that's the only place we can have vcpu->mode == IN_GUEST_MODE
> > AFAIU...).  The problem is that IIUC the only place we'll set SN=1 is
> > in kvm_sched_out() -> vmx_vcpu_pi_put() where we want to schedule the
> > vcpu thread out of the physical CPU, and I don't see how that can
> > happen between [1] and [2].
> >
> > Paolo/Marcelo, could any of you help me to understand what I've
> > missed?
> >
> > >
> > > Another point is that Jaison Raju from support is asking some
> > > questions in comment 5 about memballoon. While I think it's worth
> > > to check this configuration, I think this comment is kind of
> > > "shooting in all directions". I'd suggest we stay focused on
> > > debugging the warning issue first and follow up from there.
> >
> > And yes I totally agree we should start with this warning first.
> >
> > Thanks,
> >
> > --
> > Peter Xu
> 
> Peter,
> 
> vmx_vcpu_pi_put() can leave the SN bit set (in the cmpxchg case of
> current RHEL).
> 
> This one has not been backported:
> 
> commit 81b016676e1c8f58027bd4d2b1d8a981776b36fe
> Author: Luwei Kang <luwei.kang>
> Date:   Thu Jan 31 16:52:02 2019 +0800
> 
>     KVM: x86: Sync the pending Posted-Interrupts
>     Some Posted-Interrupts from passthrough devices may be lost or
>     overwritten when the vCPU is in runnable state.
> 
>     The SN (Suppress Notification) of PID (Posted Interrupt Descriptor)
> will
>     be set when the vCPU is preempted (vCPU in KVM_MP_STATE_RUNNABLE
> state
>     but not running on physical CPU). If a posted interrupt coming at
> this
>     time, the irq remmaping facility will set the bit of PIR (Posted
>     Interrupt Requests) without ON (Outstanding Notification).
>     So this interrupt can't be sync to APIC virtualization register and
>     will not be handled by Guest because ON is zero.
> 
>     Signed-off-by: Luwei Kang <luwei.kang>
>     [Eliminate the pi_clear_sn fast path. - Paolo]
>     Signed-off-by: Paolo Bonzini <pbonzini>

Hi, Pei,

Thanks a lot for sharing the discussion. I don't really know much about the Posted Interrupt, but seems we've found the patches which may fix the issue. :)

One thing caught my attention in the discussion is, the function "kvm_sched_out() -> vmx_vcpu_pi_put() where we want to schedule the vcpu thread out of the physical CPU". I believe in our setup we are doing a cpu pinning, and guest VM has a "idle=poll" in GRUB. So the only possibility that I can think of is for scheduling out the vcpu thread is to deliver the timer interrupt, am I right? Is there any possible use case which will trigger this stack on host?

I am still waiting on the reply from customer to understand the activities that they are doing in vcpu2.

Thanks very much!

Regards,
Yichen

Comment 11 Yichen Wang 2019-06-11 07:20:32 UTC

Forgot to mention one more thing that caught my attention. I know that customer does have DPDK-IO on a SRIOV VF (i40evf). As SRIOV is a passthrough, so I would say the patch 81b016676e1c8f58027bd4d2b1d8a981776b36fe will likely to fix the issue. Just my two cents...

Thanks very much!

Regards,
Yichen

Comment 12 Peter Xu 2019-06-11 07:31:27 UTC

Hi, Yichen,

(In reply to Yichen Wang from comment #10)
> One thing caught my attention in the discussion is, the function
> "kvm_sched_out() -> vmx_vcpu_pi_put() where we want to schedule the vcpu
> thread out of the physical CPU". I believe in our setup we are doing a cpu
> pinning, and guest VM has a "idle=poll" in GRUB. So the only possibility
> that I can think of is for scheduling out the vcpu thread is to deliver the
> timer interrupt, am I right? Is there any possible use case which will
> trigger this stack on host?

That's only some technical discussion about the commit itself rather than any discussion about the bug we've encountered, so please feel free to ignore that paragraph.

For the backtrace in the bug, the vcpu should be writting to one L2APIC MSR (I think it's an IPI) and it triggered a warning, which dragged things down.

Peter

Comment 13 Yichen Wang 2019-06-11 08:21:47 UTC

(In reply to Peter Xu from comment #12)
> Hi, Yichen,
> 
> (In reply to Yichen Wang from comment #10)
> > One thing caught my attention in the discussion is, the function
> > "kvm_sched_out() -> vmx_vcpu_pi_put() where we want to schedule the vcpu
> > thread out of the physical CPU". I believe in our setup we are doing a cpu
> > pinning, and guest VM has a "idle=poll" in GRUB. So the only possibility
> > that I can think of is for scheduling out the vcpu thread is to deliver the
> > timer interrupt, am I right? Is there any possible use case which will
> > trigger this stack on host?
> 
> That's only some technical discussion about the commit itself rather than
> any discussion about the bug we've encountered, so please feel free to
> ignore that paragraph.
> 
> For the backtrace in the bug, the vcpu should be writting to one L2APIC MSR
> (I think it's an IPI) and it triggered a warning, which dragged things down.
> 
> Peter

Hi, Peter,

Thanks a lot for your reply. When plain cyclictest was running, I see below messages coming to host dmesg pretty frequently:
[67683.477201] kvm_get_msr_common: 53 callbacks suppressed
[67683.477203] kvm [289365]: vcpu12 ignored rdmsr: 0x4e
[67683.491748] kvm_set_msr_common: 20 callbacks suppressed
[67683.491749] kvm [289365]: vcpu12 ignored wrmsr: 0x4e data 2
[67683.506904] kvm [289365]: vcpu12 ignored rdmsr: 0x4e
[67683.535912] kvm [289365]: vcpu13 ignored rdmsr: 0x4e
[67683.542900] kvm [289365]: vcpu13 ignored wrmsr: 0x4e data 2
[67683.550429] kvm [289365]: vcpu13 ignored rdmsr: 0x4e
[67683.564607] kvm [289365]: vcpu14 ignored rdmsr: 0x4e
[67683.571429] kvm [289365]: vcpu14 ignored wrmsr: 0x4e data 2
[67683.578785] kvm [289365]: vcpu14 ignored rdmsr: 0x4e
[67683.607474] kvm [289365]: vcpu15 ignored rdmsr: 0x4e
[67683.614213] kvm [289365]: vcpu15 ignored wrmsr: 0x4e data 2
[67683.621569] kvm [289365]: vcpu15 ignored rdmsr: 0x4e
[67683.634478] kvm [289365]: vcpu16 ignored rdmsr: 0x4e
[67683.641148] kvm [289365]: vcpu16 ignored wrmsr: 0x4e data 2
[67683.648429] kvm [289365]: vcpu16 ignored rdmsr: 0x4e
[67683.677078] kvm [289365]: vcpu17 ignored wrmsr: 0x4e data 2
[67689.276957] kvm_get_msr_common: 13 callbacks suppressed
[67689.276959] kvm [289365]: vcpu0 ignored rdmsr: 0x611
[67689.290618] kvm [289365]: vcpu0 ignored rdmsr: 0x639
[67689.297173] kvm [289365]: vcpu0 ignored rdmsr: 0x641
[67689.303705] kvm [289365]: vcpu0 ignored rdmsr: 0x619
[67689.326863] kvm [289365]: vcpu0 ignored rdmsr: 0x611
[67689.333358] kvm [289365]: vcpu0 ignored rdmsr: 0x639
[67689.339760] kvm [289365]: vcpu0 ignored rdmsr: 0x641
[67689.346084] kvm [289365]: vcpu0 ignored rdmsr: 0x619
[67689.362541] kvm [289365]: vcpu0 ignored rdmsr: 0x611
[67689.368786] kvm [289365]: vcpu0 ignored rdmsr: 0x639

We have below parameters on top of default ko parameters:
options kvm ignore_msrs=Y
options kvm kvmclock_periodic_sync=0
options kvm_intel nested=Y
options kvm_intel ple_window=0
options kvm_intel ple_gap=0

However, I remembered we saw this long time ago, but it doesn't seem to create big trouble for us, so we kind of ignroed. Not sure if this implies anything.

Also, just receive the information from customer. Quote:
"It is another application worker thread in a tight loop which is mostly user space but with some system calls (like pipe write, sem post, sched_yield, mutex/spinlock). All IRQ’s in the system are pinned to non-isolated cores using irqaffinity in the kernel command line. So in theory no IRQ’s should land on this vcpu 2."

Hope above helps.

Thanks very much!

Regards,
Yichen

Comment 14 Peter Xu 2019-06-11 12:09:34 UTC

(In reply to Yichen Wang from comment #13)
> When plain cyclictest was running, I see below
> messages coming to host dmesg pretty frequently:
> [67683.477201] kvm_get_msr_common: 53 callbacks suppressed
> [67683.477203] kvm [289365]: vcpu12 ignored rdmsr: 0x4e
> [67683.491748] kvm_set_msr_common: 20 callbacks suppressed
> [67683.491749] kvm [289365]: vcpu12 ignored wrmsr: 0x4e data 2
> [67683.506904] kvm [289365]: vcpu12 ignored rdmsr: 0x4e
> [67683.535912] kvm [289365]: vcpu13 ignored rdmsr: 0x4e
> [67683.542900] kvm [289365]: vcpu13 ignored wrmsr: 0x4e data 2
> [67683.550429] kvm [289365]: vcpu13 ignored rdmsr: 0x4e
> [67683.564607] kvm [289365]: vcpu14 ignored rdmsr: 0x4e

These errors reminded me about commit fab0aa3b776f which should also help for perf critical tasks, then we can turn report_ignored_msrs completely off.  Maybe I can add that into the backport too?  Paolo/Marcelo?

Comment 15 Peter Xu 2019-06-13 01:39:57 UTC

(In reply to Yichen Wang from comment #8)
> Just want to give some updates. I can't really reproduce it on my testbed
> for the past 3 days, and things just went fine. I am requesting some
> information from customer on what they do in vcpu2 and hopefully that will
> help.

Yichen & Pei,

In parallel with the backporting - please let us know if any of you (or the customer) can reproduce the problem somewhere.

Comment 16 Pei Zhang 2019-06-13 10:56:11 UTC

Update:

I didn't reproduce this issue after 2 days cyclictest yet. The cyclictest test is still ongoing. 

Versions(Same with Description):
3.10.0-957.12.1.rt56.927.el7.x86_64
qemu-img-rhev-2.12.0-18.el7_6.1.x86_64

Comment 17 Yichen Wang 2019-06-13 17:48:23 UTC

Update:

I can't either for 3 days cyclictest. So it must be customer's special workload that triggers the kdump. However when I talked to the customer, even they are not seeing that often, it is like every few days. When it happens, their VM also crash because missing the realtime-ness. Even they are having a hard time to reproduce, and all needed information (sosreport, logs, workload explaination) they've shared to us already, and we don't know what we can ask for more.

Maybe when we backported the patches, we can give a trail version for customer to try out again to see if they can reproduce it.

Thanks very much!

Regards,
Yichen

Comment 66 errata-xmlrpc 2019-08-06 12:36:32 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2019:2043