This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux major release. Product Management has requested further
review of this request by Red Hat Engineering, for potential inclusion in a Red
Hat Enterprise Linux Major release. This request is not yet committed for
I've been looking at this problem and made some progress, although it's not all encouraging news.
I think I've described the basic problem enough, so I'll skip it here and just launch into specifics. In the KVM code, there are a number of places where we make incorrect assumptions regarding the timer and which vcpu to deliver it to:
1) In the i8254.c code when the hrtimer representing the PIT expires. In
this case, when we get the callback, we kick only the BSP.
2) In the i8254.c code when a vcpu is migrated from one processor to another.
In this case we only migrate the PIT timer if the vcpu to be migrated is the
3) In the lapic code when deciding whether to accept a PIC interrupt, we only accept interrupts on the BSP.
4) In the irq_comm.c code when calling kvm_irq_delivery_to_apic(). The problem
here is that we don't take into account the fact that an LAPIC might be disabled when trying to deliver an interrupt in DM_LOWEST mode. Further, on a kexec, the processor that we are kexec'ing *to* gets it's APIC ID re-written to the BSP APIC ID. What it means in the end is that we are currently still matching against the BSP even though vcpu 1 (where the kexec is happening) would match if we let it.
I have a patch currently that can take care of 1), 3), and 4), and works in my
testing (it needs to be cleaned up a bit to not be so inefficient, but it should work). However, problem 2) is pretty sticky. The reason we are currently migrating the PIT timer around with the BSP is pretty well explained in commit 2f5997140f22f68f6390c49941150d3fa8a95cb7. With my new patch, though, we are no longer guaranteeing that we are going to inject onto CPU 0. I think we can do something where when the hrtimer expires, we figure out which processor will get the timer interrupt and IPI to that processor to cause the VMEXIT. Unfortunately it's racy because the expiration of the hrtimer is de-coupled from the setting of the IRQ for the interrupt. That means that the hrtimer could expire, we could choose vcpu 2 (say), IPI to cause a VMEXIT, but by the time it goes to VMENTER the guest has changed something in the (IO)APIC(s) so that the set_irq logic chooses vcpu 3 to do the injection. This would result in a delayed injection that 2f5997 is trying to avoid.
Taking a step back, it seems to me that something along the lines of my previous patchset (where we do set_irq directly from the hrtimer callback) is the right way to go. We would still need to IPI to the appropriate physical cpu to cause a VMEXIT on the cpu we care about, but we would avoid the race I describe in the previous paragraph. Unfortunately that patchset is also much more risky.
Thoughts? I'll also post a similar mail to kvm@ to gauge opinions there, but does anyone have any thoughts here?
Patch(es) available on kernel-2.6.32-42.el6
Red Hat Enterprise Linux 6.0 is now available and should resolve
the problem described in this bug report. This report is therefore being closed
with a resolution of CURRENTRELEASE. You may reopen this bug report if the
solution does not work for you.