Bug 483204 - kvm.ko kvm_mmu_write_pte()/crash is_largepage_backed()
Summary: kvm.ko kvm_mmu_write_pte()/crash is_largepage_backed()
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Fedora
Classification: Fedora
Component: kvm
Version: 10
Hardware: x86_64
OS: Linux
low
medium
Target Milestone: ---
Assignee: Marcelo Tosatti
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks: F11VirtTarget
TreeView+ depends on / blocked
 
Reported: 2009-01-30 07:23 UTC by Gilboa Davara
Modified: 2009-02-03 17:48 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2009-01-31 19:19:19 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
Crash log. (5.26 KB, text/plain)
2009-01-30 07:23 UTC, Gilboa Davara
no flags Details

Description Gilboa Davara 2009-01-30 07:23:19 UTC
Created attachment 330443 [details]
Crash log.

Description of problem:
Just upgraded the machine to a recent kernel.
The machine is running a number of VMs (Mostly CentOS 5.2 machines).
After ~48 hours of work, KVM crashed.

Version-Release number of selected component (if applicable):
$ uname -a
Linux gilboa-home-srv 2.6.27.12-170.2.5.fc10.x86_64 #1 SMP Wed Jan 21 01:33:24 EST 2009 x86_64 x86_64 x86_64 GNU/Linux


How reproducible:
Unknown at this point.


Additional info:
Hardware:
AMD Athlon64X2 5000+.
4GB.
4 x 320GB SATA in software RAID5.
nVidia binary driver.

Comment 1 Mark McLoughlin 2009-01-30 08:59:42 UTC
Marcelo: here's another MMU oops:

BUG: unable to handle kernel paging request at ffffc2000491c808
IP: [<ffffffffa0a360a2>] is_largepage_backed+0x2b/0xa2 [kvm]
PGD 17fc0b067 PUD 17fc0c067 PMD 175bc2067 PTE 0
...
Call Trace:
 [<ffffffffa0a3734e>] kvm_mmu_pte_write+0x141/0x7e2 [kvm]
 [<ffffffffa0a37a97>] ? paging32_walk_addr+0xa8/0x247 [kvm]
 [<ffffffffa0a2cf10>] ? kvm_write_guest_page+0x57/0x6d [kvm]
 [<ffffffffa0a3119e>] emulator_write_phys+0x37/0x47 [kvm]


Tried look in kerneloops.org, but I think the search is broken right now - e.g. searching for kvm_mmu_pte_write() doesn't even show:

  http://www.kerneloops.org/oops.php?number=133016

Comment 2 Glauber Costa 2009-01-30 15:15:08 UTC
I'm not even looking at your problem until you confirm you're able to reproduce it without the Nvidia binary driver (or any other) loaded.

Comment 3 Mark McLoughlin 2009-01-30 15:50:22 UTC
(In reply to comment #2)
> I'm not even looking at your problem until you confirm you're able to reproduce
> it without the Nvidia binary driver (or any other) loaded.

Good point, I didn't notice that.

Gilboa: please try again without the nVidia driver

Comment 4 Gilboa Davara 2009-01-30 22:13:14 UTC
Guys, I do appreciate the logic behind your request (though I doubt that my actions justified your "screw you - you nVidia users, you!" tone [@Glauber]), but,
I doubt that I'll be able to reproduce this problem any time soon. This is the first KVM crash I've had since... F8? (kernel-tainting-children-eating binary drivers aside)
Never the less, I'll try and migrate a couple of VM to one of my quad-socket-servers (I'll replace RHEL5.3 w/ F10 just for you :)) and see if I can reproduce the problem.

- Gilboa

Comment 5 Mark McLoughlin 2009-01-31 19:19:19 UTC
(In reply to comment #4)
> Guys, I do appreciate the logic behind your request (though I doubt that my
> actions justified your "screw you - you nVidia users, you!" tone [@Glauber]),

You're right ... see bug #480779 for a slightly more diplomatic response:

  "please try and reproduce without the nvidia driver loaded. We have no
   ability to fix issues caused by a closed source kernel module."

The point does still stand, though - developers like Glauber and Marcelo are very keen to fix any bona fide KVM issues, but all developers have been bitten in the past by bizarre bugs caused by binary kernel modules. Hence the violent reaction against any tainted oops :-)

> but,
> I doubt that I'll be able to reproduce this problem any time soon. This is the
> first KVM crash I've had since... F8? (kernel-tainting-children-eating binary
> drivers aside)
> Never the less, I'll try and migrate a couple of VM to one of my
> quad-socket-servers (I'll replace RHEL5.3 w/ F10 just for you :)) and see if I
> can reproduce the problem.

Thanks much for trying.

Closing this as WORKSFORME for now, but please do re-open if you reproduce.

Comment 6 Marcelo Tosatti 2009-01-31 19:55:29 UTC
Just as a datapoint, a similar oops was reproduced by Jan Kiska, who also
had a tainted kernel (by madwifi):

http://markmail.org/message/h3q22m5suspzjgrj

Comment 7 Glauber Costa 2009-02-03 13:39:55 UTC
I am sorry if my tone sounded aggressive for you. That was certainly not my intention, but can happen in written-only communications. The point is, as Mark pointed out, we are under no condition to even try fixing those bugs.

We had several issues in the past on which just rebooting the kernel without the binary driver seemed to fix the problem, and as we don't have the source code to debug it, it's unfortunately a waste of time ;-(

Please not that it's not because we don't think a subset of users is less important than others (in this case, nvidia or any binary driver users), but rather a pure matter of technical inability due to lack of driver source code.

Comment 8 Gilboa Davara 2009-02-03 17:48:54 UTC
Glauber,

Being a kernel developer myself, I can appreciate how difficult it is (if not practically impossible) the reproduce a bug report that includes proprietary code - especially when both KVM and the nVidia driver are shifting pages around, especially given the fact that this bug hit once in ~2 years...

As I pointed before, being a good Fedora citizen I reported this bug, even though I was well aware that this bug report can only be used as data point / reference.

Anyway, neither the S7000/Xeon machine, nor the Athlon64 machine crashed thus far (knock wood).

Thanks again for taking the time to help,
- Gilboa


Note You need to log in before you can comment on or make changes to this bug.