Red Hat Bugzilla – Bug 483204
kvm.ko kvm_mmu_write_pte()/crash is_largepage_backed()
Last modified: 2009-02-03 12:48:54 EST
Created attachment 330443 [details]
Description of problem:
Just upgraded the machine to a recent kernel.
The machine is running a number of VMs (Mostly CentOS 5.2 machines).
After ~48 hours of work, KVM crashed.
Version-Release number of selected component (if applicable):
$ uname -a
Linux gilboa-home-srv 18.104.22.168-170.2.5.fc10.x86_64 #1 SMP Wed Jan 21 01:33:24 EST 2009 x86_64 x86_64 x86_64 GNU/Linux
Unknown at this point.
AMD Athlon64X2 5000+.
4 x 320GB SATA in software RAID5.
nVidia binary driver.
Marcelo: here's another MMU oops:
BUG: unable to handle kernel paging request at ffffc2000491c808
IP: [<ffffffffa0a360a2>] is_largepage_backed+0x2b/0xa2 [kvm]
PGD 17fc0b067 PUD 17fc0c067 PMD 175bc2067 PTE 0
[<ffffffffa0a3734e>] kvm_mmu_pte_write+0x141/0x7e2 [kvm]
[<ffffffffa0a37a97>] ? paging32_walk_addr+0xa8/0x247 [kvm]
[<ffffffffa0a2cf10>] ? kvm_write_guest_page+0x57/0x6d [kvm]
[<ffffffffa0a3119e>] emulator_write_phys+0x37/0x47 [kvm]
Tried look in kerneloops.org, but I think the search is broken right now - e.g. searching for kvm_mmu_pte_write() doesn't even show:
I'm not even looking at your problem until you confirm you're able to reproduce it without the Nvidia binary driver (or any other) loaded.
(In reply to comment #2)
> I'm not even looking at your problem until you confirm you're able to reproduce
> it without the Nvidia binary driver (or any other) loaded.
Good point, I didn't notice that.
Gilboa: please try again without the nVidia driver
Guys, I do appreciate the logic behind your request (though I doubt that my actions justified your "screw you - you nVidia users, you!" tone [@Glauber]), but,
I doubt that I'll be able to reproduce this problem any time soon. This is the first KVM crash I've had since... F8? (kernel-tainting-children-eating binary drivers aside)
Never the less, I'll try and migrate a couple of VM to one of my quad-socket-servers (I'll replace RHEL5.3 w/ F10 just for you :)) and see if I can reproduce the problem.
(In reply to comment #4)
> Guys, I do appreciate the logic behind your request (though I doubt that my
> actions justified your "screw you - you nVidia users, you!" tone [@Glauber]),
You're right ... see bug #480779 for a slightly more diplomatic response:
"please try and reproduce without the nvidia driver loaded. We have no
ability to fix issues caused by a closed source kernel module."
The point does still stand, though - developers like Glauber and Marcelo are very keen to fix any bona fide KVM issues, but all developers have been bitten in the past by bizarre bugs caused by binary kernel modules. Hence the violent reaction against any tainted oops :-)
> I doubt that I'll be able to reproduce this problem any time soon. This is the
> first KVM crash I've had since... F8? (kernel-tainting-children-eating binary
> drivers aside)
> Never the less, I'll try and migrate a couple of VM to one of my
> quad-socket-servers (I'll replace RHEL5.3 w/ F10 just for you :)) and see if I
> can reproduce the problem.
Thanks much for trying.
Closing this as WORKSFORME for now, but please do re-open if you reproduce.
Just as a datapoint, a similar oops was reproduced by Jan Kiska, who also
had a tainted kernel (by madwifi):
I am sorry if my tone sounded aggressive for you. That was certainly not my intention, but can happen in written-only communications. The point is, as Mark pointed out, we are under no condition to even try fixing those bugs.
We had several issues in the past on which just rebooting the kernel without the binary driver seemed to fix the problem, and as we don't have the source code to debug it, it's unfortunately a waste of time ;-(
Please not that it's not because we don't think a subset of users is less important than others (in this case, nvidia or any binary driver users), but rather a pure matter of technical inability due to lack of driver source code.
Being a kernel developer myself, I can appreciate how difficult it is (if not practically impossible) the reproduce a bug report that includes proprietary code - especially when both KVM and the nVidia driver are shifting pages around, especially given the fact that this bug hit once in ~2 years...
As I pointed before, being a good Fedora citizen I reported this bug, even though I was well aware that this bug report can only be used as data point / reference.
Anyway, neither the S7000/Xeon machine, nor the Athlon64 machine crashed thus far (knock wood).
Thanks again for taking the time to help,