Red Hat Bugzilla – Bug 507550
[RHEL5.4 KVM]: Instant reboot when kexec'ing on AMD
Last modified: 2014-03-25 20:58:30 EDT
Description of problem:
I'm running a RHEL-5.4 x86_64 guest under a RHEL-5.4 x86_64 kvm AMD host. When trying to kexec into a new kernel inside the guest, instead of booting the new kernel the guest actually instantly reboots. Under Intel, this doesn't happen (although kexec doesn't complete for other, unrelated reasons). I've partially tracked it down; when trying to do relocate_kernel inside the guest, at some point we try to fill the cr3 with the new temporary page tables for the new kernel:
Dump of assembler code for function relocate_new_kernel:
0xffffffff8006211f <relocate_new_kernel+0>: pushq $0x0
0xffffffff80062121 <relocate_new_kernel+2>: popfq
0xffffffff80062122 <relocate_new_kernel+3>: mov (%rsi),%r8
0xffffffff80062125 <relocate_new_kernel+6>: mov 0x80(%rsi),%rcx
0xffffffff8006212c <relocate_new_kernel+13>: mov 0x10(%rsi),%r9
0xffffffff80062130 <relocate_new_kernel+17>: mov %r9,%cr3
It's right at the last instructions that the reboot occurs. Looking at the dmesg on the host from this time, we see:
kvm: inject_page_fault: double fault 0xffff810037c97010
kvm: inject_page_fault: double fault 0xffff810037ca6010
So KVM is upset because it's trying to inject a page fault while an exception is already pending. I still need to track down which exception is already pending and why.
Ug, this is actually worse than I thought. During a crash, we go through:
kernel/kexec.c:crash_kexec() -> arch/x86_64/kernel/crash.c:machine_crash_shutdown(). In machine_crash_shutdown(), one of the things we do is an nmi_shootdown_cpus(); this is supposed to go to all of the other (non-crashing) cpus in the system, deliver an NMI IPI to them, and basically make them spin in a loop. The problem is that this IPI doesn't seem to be getting delivered to the other CPUS *at all*, meaning that they are still running around doing other things, and when we go to switch out the page tables, they then fault, double fault, and triple fault trying to access their text pages (I think). So the next thing to find out is why no NMI IPI's are being delivered to these CPU's, even though they should.
It keeps getting worse. The reason NMI IPI's aren't being delivered is because in RHEL-5, AMD has no NMI delivery support. None at all. What this means is that kdump in the guest kernel goes to deliver an NMI IPI, but the underlying KVM implementation more-or-less just completely discards this. So the other CPUs continue on their merry way, until the page tables get ripped out from under them and they triple fault.
Now, SVM NMI support has recently (April) been added to the upstream kernel. The problem is that it requires a re-write of the IRQ injection. So a backport is not really possible. I'm going to look at essentially a re-implementation of that support in the RHEL-5 sources, to see if I can something that works. Apparently NMI support is also required to pass some WHQL tests, so it will be a good thing to have working.
Hope it is not too late, this is dangerous change at this stage.
Gleb, any comments, pointers?
(In reply to comment #3)
> Hope it is not too late, this is dangerous change at this stage.
Right, but it is more-or-less required functionality.
> Gleb, any comments, pointers?
I talked to Gleb about this on IRC; basically, the patches that went upstream cannot be backported to RHEL-5.4 since they depend on the interrupt re-working. I've been looking at doing a different implementation of it, based on Gleb's work, but obviously fairly different. Once I have a patch, I'll send it to Gleb + company, and we can see if it is too dangerous and risky to take. Then we can decide on when and where to put it in.
Can QE test latest code in the Z stream? Gleb added NMI support so it should work.
I have an attempt in kvm-83-105.el5_4.9
Guest kernel panic :
could you have a look.What might go wrong ?
The oops also occur in Intel' Host (does this mean the original issue is gone ? )
Created attachment 364138 [details]
save the log to attachment.
This does not look like the original problem. You would get immediately reset from a AMD guest. I have seen that you are loading the original initramfs instead of reserving a memory and using kdump. Can you try to use kdump to see if you meet the same problem? If so, that is something we probably need to fix. Ideally, you can try it by specifying a dump target in kdump.conf (copying VMCores from the kdump initramfs) and without it (copying VMCores from kdump daemon by running INIT in the second kernel).
(In reply to comment #6)
> Hi Chris
> I have an attempt in kvm-83-105.el5_4.9
> Guest kernel panic :
> could you have a look.What might go wrong ?
> The oops also occur in Intel' Host (does this mean the original issue is
> gone ? )
Previous to this fix, you wouldn't get nearly this far; as soon as you executed
the kexec command, the machine would reboot. That means that the issue listed
in this particular BZ is indeed fixed. Please open up a new BZ about the
secondary crash you are seeing, since that is something new (and should be
Note that if you do try Cai's suggestion on an SMP guest, you have a 50% chance of hitting another bug that I am working on, BZ 505527. To be absolutely certain, make sure that you test with a UP guest to avoid that bug.