Bug 492838 - kvm: PTHREAD_PRIO_INHERIT mutexes fail to unlock sometimes with 32 bit i686/PAE guest kernel, works on i686/non-PAE
kvm: PTHREAD_PRIO_INHERIT mutexes fail to unlock sometimes with 32 bit i686/P...
Status: CLOSED RAWHIDE
Product: Fedora
Classification: Fedora
Component: kernel (Show other bugs)
rawhide
i386 Linux
high Severity high
: ---
: ---
Assigned To: Karen Noel
Fedora Extras Quality Assurance
:
: 493801 (view as bug list)
Depends On:
Blocks: F11VirtTarget
  Show dependency treegraph
 
Reported: 2009-03-30 06:20 EDT by Jens Petersen
Modified: 2013-01-10 00:07 EST (History)
19 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2009-05-25 13:53:22 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
canberra-gtk-play-bugreport.txt (230.69 KB, text/plain)
2009-03-30 06:21 EDT, Jens Petersen
no flags Details
Bug Buddy report with appropriate debuginfo packages installed (11.22 KB, text/plain)
2009-03-30 13:10 EDT, Julian Sikorski
no flags Details
Patch that makes the problem go away (367 bytes, patch)
2009-03-30 15:19 EDT, Julian Sikorski
no flags Details | Diff
Updated bug-buddy report (11.81 KB, text/plain)
2009-04-30 02:46 EDT, Julian Sikorski
no flags Details
Totem bug-buddy report (12.96 KB, text/plain)
2009-04-30 02:48 EDT, Julian Sikorski
no flags Details
Metacity backtrace (46 bytes, text/plain)
2009-04-30 17:03 EDT, Julian Sikorski
no flags Details
Metacity backtrace (15.80 KB, text/plain)
2009-04-30 17:04 EDT, Julian Sikorski
no flags Details
test program with two threads at different priorities (1.50 KB, text/plain)
2009-05-05 19:40 EDT, Chuck Ebbert
no flags Details
pulseaudio log (36.56 KB, text/plain)
2009-05-18 16:26 EDT, Julian Sikorski
no flags Details
host kernel fix (399 bytes, patch)
2009-05-22 10:53 EDT, Avi Kivity
no flags Details | Diff
patch to reload cr3 when cr4 is reloaded (501 bytes, patch)
2009-05-24 15:23 EDT, Avi Kivity
no flags Details | Diff
replacement for second patch (897 bytes, patch)
2009-05-25 05:57 EDT, Avi Kivity
no flags Details | Diff

  None (edit)
Description Jens Petersen 2009-03-30 06:20:00 EDT
Description of problem:
canberra-gtk-play crashes at desktop startup, presumably trying to play the gnome login jingle.

Version-Release number of selected component (if applicable):
libcanberra-0.11-8.fc11

How reproducible:
every time

Steps to Reproduce:
1. boot F11 Beta and login to gdm
  
Actual results:
Bug Buddy dialog appears

Expected results:
jingle to play and no crash
Comment 1 Jens Petersen 2009-03-30 06:21:39 EDT
Created attachment 337199 [details]
canberra-gtk-play-bugreport.txt

this is the info saved by bug-buddy.

I am running inside qemu-kvm fwiw.
Comment 2 Lennart Poettering 2009-03-30 10:17:26 EDT
Uh. Strange issue. Any chance you can get me a stack trace?
Comment 3 Julian Sikorski 2009-03-30 10:30:53 EDT
I can try to obtain it provided it'll still be reproducible after pulling today's updates. There is also a shitload of selinux denials during logging in related to pulseaudio, so this could be related.
Comment 4 Julian Sikorski 2009-03-30 13:10:34 EDT
Created attachment 337232 [details]
Bug Buddy report with appropriate debuginfo packages installed

This happens with up-to-date rawhide as well. SELinux denials are gone, they must have been unrelated.
Comment 5 Lennart Poettering 2009-03-30 13:40:10 EDT
Given that both cases this happened are in a vm and I cannot make the slightest sense of this I am tempted to say that this is a bug in kvm in some way. Maybe we should CC someone from the kvm folks?
Comment 6 Lennart Poettering 2009-03-30 14:12:01 EDT
Hmm, this is probably related to PTHREAD_PRIO_INHERIT in some way.
Comment 7 Lennart Poettering 2009-03-30 15:15:17 EDT
It seems that when PTHREAD_PRIO_INHERIT is set for a mutex sometimes pthread_mutex_unlock() fails for no apparent reason. 

Changing PA to not use PTHREAD_PRIO_INHERIT makes the problem go away, as it seems.

Reassigning to kernel.
Comment 8 Julian Sikorski 2009-03-30 15:19:50 EDT
Created attachment 337239 [details]
Patch that makes the problem go away

Pulseaudio packages for i386 with the said patch included are available here:
http://belegdol.fedorapeople.org/pulse
They're i386 instead of i586 since I didn't know how to build i586 ones on my Fedora 10 x86_64, but I hope it does not matter here.
Comment 9 Chuck Ebbert 2009-03-31 00:40:38 EDT
So this is a pulseaudio bug then?
Comment 10 Julian Sikorski 2009-03-31 04:28:29 EDT
This patch only disables use of PTHREAD_PRIO_INHERIT. Lennart also suspected some problems with kvm, as both I and Jens are seeing the issue inside kvm virtual machine.
Comment 11 Lennart Poettering 2009-03-31 07:19:44 EDT
(In reply to comment #9)
> So this is a pulseaudio bug then?  

No, I am pretty sure this is unrelated to PA. That's why I reassigned this to the kernel. Might be a bug in KVM, otherwise in the kernel or in nptl. No clue.
Comment 12 Jens Petersen 2009-04-06 02:47:55 EDT
A partial ""workaround"" is to remove bug-buddy ;) - that at least stops metacity crashes locking up the desktop everytime there is a desktop sound event or beep.
Comment 13 Jens Petersen 2009-04-21 03:39:41 EDT
I am not sure if the severity of this bug is appreciated - it basically means that fedora 11 desktop will not run on qemu/kvm.
Comment 14 Julian Sikorski 2009-04-28 17:59:35 EDT
For reference, this still happens with a fresh install of preview release.
Comment 15 Chuck Ebbert 2009-04-28 21:19:17 EDT
Is there a simple test program available that demonstrates this problem?
Comment 16 Julian Sikorski 2009-04-29 02:07:25 EDT
totem dies on startup as well, but I'm not sure if for the same reason.
Comment 17 Chuck Ebbert 2009-04-30 00:50:57 EDT
There isn't going to be any progress on this bug without a test program that can be used to reproduce the problem.
Comment 18 Julian Sikorski 2009-04-30 02:46:35 EDT
Created attachment 341873 [details]
Updated bug-buddy report

This happens as a result of running:
/usr/bin/canberra-gtk-play --id="desktop-login" --description="GNOME Login"
Comment 19 Julian Sikorski 2009-04-30 02:48:00 EDT
Created attachment 341874 [details]
Totem bug-buddy report

And this one is just for starting totem. I know these aren't dedicated test cases, but at least the problem is reproducible 100 % of the times.
Comment 20 Jens Petersen 2009-04-30 03:06:55 EDT
I guess pressing Tab inside a terminal is not good enough.

How to echo BELL inside a shell or from C?
Comment 21 Kyle McMartin 2009-04-30 10:32:50 EDT
I'm not convinced. What hardware is this on? Are you using VT or SVM?
Comment 22 Julian Sikorski 2009-04-30 10:52:37 EDT
You mean the host? Core 2 Duo T7200.
Comment 23 Justin M. Forbes 2009-04-30 14:21:41 EDT
I cannot reproduce this using current rawhide for both guest and host on a Core 2 Q6600 based system.  Is this still an issue with all updates?
Comment 24 Julian Sikorski 2009-04-30 14:29:12 EDT
It is, but I'm using F-10 as a host.
Comment 25 Julian Sikorski 2009-04-30 17:03:20 EDT
Created attachment 342006 [details]
Metacity backtrace

Turns out that this bug can cause metacity to lock up as well:
http://thread.gmane.org/gmane.linux.redhat.fedora.devel/111520
Comment 26 Julian Sikorski 2009-04-30 17:04:37 EDT
Created attachment 342008 [details]
Metacity backtrace
Comment 27 Kyle McMartin 2009-04-30 17:14:34 EDT
What kernel version on the F-10 host? Does 2.6.29 make it go away?
Comment 28 Julian Sikorski 2009-04-30 17:52:13 EDT
kernel-2.6.27.21-170.2.56.fc10.x86_64. kernel-2.6.29.1-42.fc10.x86_64 from updates-testing does not help.
Comment 29 Kyle McMartin 2009-04-30 18:27:23 EDT
Sigh, given this doesn't appear to happen with F-11, I'm going to go out on a limb and say it's not the kernel. The changes between the F-10 and F-11 2.6.29 kernels are minimal.

Mark, Avi, any thoughts what the cause could be?
Comment 30 Julian Sikorski 2009-04-30 18:34:11 EDT
My wild guess is that it could be kvm.
Comment 31 Jens Petersen 2009-04-30 18:57:46 EDT
(In reply to comment #29)
> Sigh, given this doesn't appear to happen with F-11

Erm, who said it is not happening with F11: I can reproduce with both F10 and F11 hosts.
Comment 32 Jens Petersen 2009-04-30 18:59:45 EDT
(In reply to comment #23)
> I cannot reproduce this using current rawhide for both guest and host on a Core
> 2 Q6600 based system.  Is this still an issue with all updates?  

Well I can: rawhide-i386 host and guest on a Dell Precision 390.
Comment 33 Kyle McMartin 2009-04-30 19:19:27 EDT
I presume Justin's box is 64-bit, sigh. 32-bit specific? Nice.
Comment 34 Kyle McMartin 2009-04-30 20:02:02 EDT
http://kyle.fedorapeople.org/pthread_prio_inherit_test.c

Can you try running this in your guest?

Should build with:

kyle@ihatethathostname ~ $ gcc -D_XOPEN_SOURCE=500 -lpthread -o pthread_prio_inherit_test pthread_prio_inherit_test.c


Thanks, Kyle
Comment 35 Jens Petersen 2009-04-30 21:44:51 EDT
(In reply to comment #34)
> http://kyle.fedorapeople.org/pthread_prio_inherit_test.c

Yes, runs fine I think:

$ ./pthread_prio_inherit_test 
lock... acquired... released...
lock... acquired... released...
lock... acquired... released...
lock... acquired... released...
lock... acquired... released...
lock... acquired... released...
lock... acquired... released...
lock... acquired... released...
lock... acquired... released...
lock... acquired... released...
lock... acquired... released...
lock... acquired... released...
lock... acquired... released...
lock... acquired... released...
:
Comment 36 Jens Petersen 2009-04-30 21:46:38 EDT
A few more data points:

I can't reproduce with rawhide-i386 live image guests FWIW.

I have reproduced on F-10 x86_64 and F-11 i386 hosts (both with i386 guest).

I will try testing a 64bit guest later.
Comment 37 Jens Petersen 2009-04-30 22:11:17 EDT
I find it very easy reproduce now (I mean even after removing bug-buddy;) by just pressing Tab a few times in gnome-terminal and I see metacity restart which causes the gnome-terminal to lose focus.  (But I guess if one doesn't get the initial metacity lockup on a fresh guest install say then one can't reproduce.)
Comment 38 Jens Petersen 2009-05-05 03:44:58 EDT
(In reply to comment #36)
> I will try testing a 64bit guest later.  

Anaconda has been preventing me from doing this yet unfortunately...
Comment 39 Chuck Ebbert 2009-05-05 19:40:43 EDT
Created attachment 342559 [details]
test program with two threads at different priorities

This might be a more realistic test program. Compile the same as the previous one: gcc -o pmutex -O2 -lpthread -D_XOPEN_SOURCE=500 pmutex.c
Comment 40 Jens Petersen 2009-05-06 02:52:28 EDT
Thanks, pmutex.c also seems fine:

:
parent acquired...child acquired...released
released
parent acquired...child acquired...released
released
parent acquired...child acquired...released
released
parent acquired...child acquired...released
released
parent acquired...child acquired...released
released
child acquired...parent acquired...released
released
child acquired...parent acquired...released
released
child acquired...parent acquired...released
released
child acquired...parent acquired...released
released
child acquired...parent acquired...released
released
child acquired...parent acquired...released
released
child acquired...parent acquired...released
released
:
Comment 41 Jens Petersen 2009-05-06 03:34:24 EDT
Okay - finally got a x86_64 rawhide guest and I can't reproduce any crashes yet there.  So making this bug under i386 arch.
Comment 42 Jens Petersen 2009-05-06 03:36:25 EDT
I should probably have added I can still reproduce this on i386 rawhide guest.
Comment 43 Jens Petersen 2009-05-06 20:14:16 EDT
*** Bug 493801 has been marked as a duplicate of this bug. ***
Comment 44 Chuck Ebbert 2009-05-06 23:16:57 EDT
Is this happening on both Intel and AMD processors or just one flavor?(In reply to comment #40)
> Thanks, pmutex.c also seems fine:
> 
> :
> parent acquired...child acquired...released
> released
> parent acquired...child acquired...released
> released

Strange, when I run it I get:

parent acquired...released
child acquired...released
parent acquired...released
child acquired...released
parent acquired...released
child acquired...released
Comment 45 Mark McLoughlin 2009-05-07 09:38:41 EDT
okay, just reproduced this in a 32 bit KVM guest on a 64 bit host (all latest rawhide)
Comment 46 Mark McLoughlin 2009-05-07 10:30:05 EDT
this happens with 32 bit SMP and UP guests

chatting to avi on IRC, he suspects it might be related to a mismatch of cpuid features in the host and guest:

host: fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 lahf_lm tpr_shadow vnmi flexpriority
guest: fpu de pse tsc msr pae mce cx8 apic pge cmov pat mmx fxsr sse sse2 up pni hypervisor

cx16 is missing in the guest
Comment 47 Kyle McMartin 2009-05-07 10:40:12 EDT
This sounds awfully likely (cx16 being cmpxchg16) although, I can't imagine how... glibc's libpthread doesn't use cmpxchg16 as near as I can tell, nor does qemu-kvm (and, indeed, only appears to be emulated on when building qemu for target x86_64?)

cheers, Kyle
Comment 48 Jens Petersen 2009-05-07 23:12:49 EDT
(In reply to comment #44)
> Is this happening on both Intel and AMD processors or just one flavor?

I have only tested on Intel (and don't have any AMD currently).

(In reply to comment #46)
> cx16 is missing in the guest  

So that changed between F10 and F11?  (Nothing to do with move to kernel-PAE.i686 in f11?)
Comment 49 Kyle McMartin 2009-05-08 10:01:45 EDT
There's a thought... does using kernel.i586 in F-11 help at all?
Comment 50 Julian Sikorski 2009-05-08 12:17:45 EDT
Seems like it does. With 2.6.29.2-126.fc11.i586 the problem seems to go away, while it still happens with the corresponding PAE kernel. I'm running F-10 x86_64 as a host if that matters.
Comment 51 Rolf Fokkens 2009-05-09 10:22:18 EDT
Same here: host is x86_64 f10, with the non-PAE kernel 2.6.29.2-126.fc11.i586 the problem is gone. With the PAE kernel same problem here.

Physical CPU is AMD 4050e.
Comment 52 Mark McLoughlin 2009-05-11 07:38:05 EDT
Well, now - that narrows things down a bit
Comment 53 Mark McLoughlin 2009-05-11 09:24:45 EDT
Important to note the PAE kernel is i586, so it's i586/non-PAE vs. i686/PAE
Comment 54 Mark McLoughlin 2009-05-11 09:25:46 EDT
Obviously, if someone had time to build and try with an i686/non-PAE kernel, that would help a lot
Comment 55 Mark McLoughlin 2009-05-11 09:56:26 EDT
Suggestions from avi:

  - strace the failing tasks, look for errors on the futex ops
  - try playing with the clocksource
Comment 56 Kyle McMartin 2009-05-11 11:56:32 EDT
Mark,

I'll do a build and update this bug with a link to the rpms.
Comment 57 Kyle McMartin 2009-05-11 12:13:18 EDT
http://koji.fedoraproject.org/koji/taskinfo?taskID=1348053
^- is the scratch build. Should be cooked in an hour or so.
Comment 58 Julian Sikorski 2009-05-11 16:27:45 EDT
The kernel linked to in comment #57 seems to work correctly.
Comment 59 Kyle McMartin 2009-05-11 18:05:59 EDT
Well shit... I wonder how long this has been broken, does i686-PAE F-10 kernels work? What about the vanilla i686 flavour on F-10? (We only killed the seperate i686 flavour for F-11...)

My guess is this is a kvm bug though. :/
Comment 60 Jens Petersen 2009-05-18 06:19:06 EDT
Sorry for the silence:

(In reply to comment #59)
> does i686-PAE F-10 kernels work? 

Naively I tried testing f10 kernel-PAE's on f11 but the boot hangs loading syslog...

I will try kernel-PAE on an f10 guest tomorrow.

> What about the vanilla i686 flavour on F-10?

I am pretty sure that works ok, as does kernel.i586.
Comment 61 Mark McLoughlin 2009-05-18 10:54:10 EDT
Avi suggests this futex fix might help:

  http://lkml.org/lkml/2009/5/18/225
Comment 62 Kyle McMartin 2009-05-18 13:34:02 EDT
http://koji.fedoraproject.org/koji/taskinfo?taskID=1361292


please try this scratch build which contains markmc's fix.
Comment 63 Mark McLoughlin 2009-05-18 15:56:02 EDT
(In reply to comment #62)
> http://koji.fedoraproject.org/koji/taskinfo?taskID=1361292
> 
> please try this scratch build which contains markmc's fix.  

Thanks Kyle; doesn't seem to help, though
Comment 64 Julian Sikorski 2009-05-18 16:13:00 EDT
Yeah, the kernel linked to in comment #62 does not help with this problem.
Comment 65 Julian Sikorski 2009-05-18 16:21:56 EDT
Also, it seems there are some issues with PA (?) when using plain i586 kernel anyway. It is impossible to e.g. make rhythmbox play an audio file, it'll loop the first few dozen miliseconds infinitely. With PAE kernel, though, any app trying to play sound through PA will crash, so this might be unrelated. pulseaudio -vvvv will says something about possible alsa bug.
Comment 66 Julian Sikorski 2009-05-18 16:26:18 EDT
Created attachment 344519 [details]
pulseaudio log

Output of pulseaudio -vvvv running for a few moments.
Lennart, is this related or rather a separate issue?
Comment 67 Julian Sikorski 2009-05-18 16:41:17 EDT
Answering to myself: it seems like these issues are unrelated, bugs #475236 and #497392 have more info.
Comment 68 Kyle McMartin 2009-05-18 16:54:17 EDT
As Linus points out in the comments on that patch, it's... crap. A better one from Thomas Gleixner is at: http://lkml.org/lkml/diff/2009/5/18/370/1

Please try a new scratch build at:
http://koji.fedoraproject.org/koji/taskinfo?taskID=1361745

(As an aside, I really hope this fixes it, otherwise there's a whole slew of futex fixes to try and backport.)
Comment 69 Kyle McMartin 2009-05-18 16:55:40 EDT
While I'm at it, can you guys try the 2.6.30-rc$n kernels as they come out in dist-f12, would help narrowing this down when that patch from tglx gets upstream if it isn't the culprit.
Comment 70 Jens Petersen 2009-05-18 19:57:55 EDT
(Ah btw I worked out why Live is ok - of course it is using kernel.i586!;)
Comment 71 Jens Petersen 2009-05-18 20:06:54 EDT
(In reply to comment #68)
> Please try a new scratch build at:
> http://koji.fedoraproject.org/koji/taskinfo?taskID=1361745

Same for me - I still metacity crashing (after removing bug-buddy).
Comment 72 Jens Petersen 2009-05-18 20:27:34 EDT
I tried kernel-2.6.30-0.81.rc5.git1.fc12 too and it seems to crash/lockup for me even more.

gdm login locks ups quite see and gnome-terminal immediately.  Rhythmbox crashed as soon as I played something.
Comment 73 Jens Petersen 2009-05-18 21:05:41 EDT
I am just noting (and also wondering why) this bug was removed from the blocker list.
Comment 74 Jens Petersen 2009-05-18 21:31:50 EDT
(In reply to comment #60)
> I will try kernel-PAE on an f10 guest tomorrow.

I tested kernel-PAE-2.6.27.21-170.2.56.fc10 guest without any problems - looks fine to me.
Comment 75 Kyle McMartin 2009-05-18 21:34:54 EDT
Reassigning to kvm.
Comment 76 Kyle McMartin 2009-05-18 22:34:53 EDT
Er, sorry, perhaps there's been some confusion. The kernels I posted through this have been for testing as guests, not hosts. Since that would be where the problem would be given it's hanging in futex code, and not somewhere else.

That said, I've got another build which attempts to disable the feature bit for PAE, who knows if it will help... but it should be booted as the host kernel.
http://koji.fedoraproject.org/koji/taskinfo?taskID=1362223
Comment 77 Jens Petersen 2009-05-19 00:42:11 EDT
(In reply to comment #76)
> The kernels I posted through this have been for testing as guests, not hosts.

No confusion - all my results above today are for PAE guests on f10 x86_64 host.
Comment 78 Mark McLoughlin 2009-05-19 02:46:42 EDT
(In reply to comment #75)
> Reassigning to kvm.  

Um, why?

(The kvm package doesn't even exist in F-11)
Comment 79 Julian Sikorski 2009-05-19 03:23:36 EDT
The kernels from comment #76 do not help either.
Comment 80 Mark McLoughlin 2009-05-21 13:12:03 EDT
(In reply to comment #73)
> I am just noting (and also wondering why) this bug was removed from the blocker
> list.  

Fair question, the reasoning is:

  1) It doesn't affect anaconda installs; they use an i586

  2) It doesn't affect live installs; they also use an i586 kernel

  3) So, this only affects people trying to use the desktop in a 32 bit
     KVM guest. Not a large enough class of users to block the release
     and the workaround is to replace kernel-PAE.i686 with kernel.i586

  4) We aren't remotely close to figuring out what the problem here is,
     so we'd be talking about delaying the release indefinitely

That's not to say this isn't a very serious bug. It certainly is.
Comment 81 Avi Kivity 2009-05-21 17:33:00 EDT
Tested the F10 PAE kernel, it is broken.  Jens, you reported it works.  Can you retest?
Comment 82 Avi Kivity 2009-05-21 19:58:23 EDT
It's a shadow mmu problem.  futex_init() dereferences a NULL pointer, expecting it to fault, but it doesn't.  This disabled most futex ops.
Comment 83 Jens Petersen 2009-05-21 21:34:23 EDT
(In reply to comment #81)
> Tested the F10 PAE kernel, it is broken.
> Jens, you reported it works.  Can you retest?  

Hmm, what is the correct way to test? :)

I am running the latest f10 kernel-PAE-2.6.27.21-170.2.56.fc10 and don't see metacity crash when I tab complete in gnome-terminal, but I guess there is a more technically correct way to test. ;)
Comment 84 Avi Kivity 2009-05-22 03:22:52 EDT
I just ran that kernel and canberra-gtk-play crashed on me.  For example 'canberra-gtk-play -i 0' should crash.
Comment 85 Jens Petersen 2009-05-22 04:42:59 EDT
Hmm dunno, for me I hear the login sound theme jingle when I start my desktop session.
Comment 86 Avi Kivity 2009-05-22 09:51:54 EDT
looks like kvm_flush_tlb() is the culprit.
Comment 87 Avi Kivity 2009-05-22 10:48:43 EDT
Okay,  the kernel is originally mapped at low addresses, and then moved to PAGE_OFFSET.  While this is done pdpte[0] == pdpte[3] in order to have identical mappings.

Later, the kernel drops pdpte[0] to unmap low addresses and tell the cpu by flushing the tlb.  However, the kvm paravirt tlb flush doesn't check pdptrs (they aren't really part of the tlb, but are reloaded as a side effect of the mov cr3 instruction).  So the low addresses remain mapped, and the futex test fails.
Comment 88 Avi Kivity 2009-05-22 10:53:56 EDT
Created attachment 345100 [details]
host kernel fix

Please test the attached patch.  Apply to host kernel!
Comment 89 Kyle McMartin 2009-05-22 11:15:37 EDT
http://koji.fedoraproject.org/koji/taskinfo?taskID=1370732

please test the scratch build found here.

Thanks!
 Kyle
Comment 90 Julian Sikorski 2009-05-22 11:57:44 EDT
Will this kernel work on F-10 host? If not, could you please provide a patched F-10 kernel as well?
Comment 91 Kyle McMartin 2009-05-22 14:34:22 EDT
It should, yes. Just remove it afterwards to get back on the 2.6.27 track.
Comment 92 Julian Sikorski 2009-05-22 18:47:05 EDT
The issue is still present with F-10 host running the kernel from comment #89 and guest running the PAE kernel.
Comment 93 Avi Kivity 2009-05-23 05:44:12 EDT
It worked for me.

Please provide:
 - host uname -a
 - guest uname -a
 - host /proc/cpuinfo
 - what you are doing to test, exactly
Comment 94 Julian Sikorski 2009-05-24 04:38:55 EDT
(In reply to comment #93)
> It worked for me.
> 
> Please provide:
>  - host uname -a
Linux snowball 2.6.29.3-157.bz492838.fc11.x86_64 #1 SMP Fri May 22 11:35:33 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux

>  - guest uname -a
Linux localhost.localdomain 2.6.29.3-155.fc11.i686.PAE #1 SMP Wed May 20 17:31:09 EDT 2009 i686 i686 i386 GNU/Linux 

>  - host /proc/cpuinfo
processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 15
model name	: Intel(R) Core(TM)2 CPU         T7200  @ 2.00GHz
stepping	: 6
cpu MHz		: 1000.000
cache size	: 4096 KB
physical id	: 0
siblings	: 2
core id		: 0
cpu cores	: 2
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 10
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm lahf_lm tpr_shadow
bogomips	: 3990.58
clflush size	: 64
cache_alignment	: 64
address sizes	: 36 bits physical, 48 bits virtual
power management:

processor	: 1
vendor_id	: GenuineIntel
cpu family	: 6
model		: 15
model name	: Intel(R) Core(TM)2 CPU         T7200  @ 2.00GHz
stepping	: 6
cpu MHz		: 1000.000
cache size	: 4096 KB
physical id	: 0
siblings	: 2
core id		: 1
cpu cores	: 2
apicid		: 1
initial apicid	: 1
fpu		: yes
fpu_exception	: yes
cpuid level	: 10
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm lahf_lm tpr_shadow
bogomips	: 3989.82
clflush size	: 64
cache_alignment	: 64
address sizes	: 36 bits physical, 48 bits virtual
power management:


>  - what you are doing to test, exactly  
Three things:
- canberra-gtk-play crashes at log-in
- totem crashes at startup
- rhythmbox crashes when trying to play a file
Comment 95 Julian Sikorski 2009-05-24 04:44:17 EDT
Avi, are you testing with F11 host? Maybe the newer kvm there has an influence on this problem?
Comment 96 Avi Kivity 2009-05-24 12:59:46 EDT
Ok, I tested 2.6.30 (which worked) and the wrong F11 guest kernel (which also worked).

So there's an additional bug in there.
Comment 97 Avi Kivity 2009-05-24 15:22:02 EDT
There is; we need to reload the PDPTEs when cr4 is reloaded.
Comment 98 Avi Kivity 2009-05-24 15:23:57 EDT
Created attachment 345263 [details]
patch to reload cr3 when cr4 is reloaded

additional host kernel fix attached.

Kyle, please spin a new test kernel with this patch in addition to the previous one.
Comment 99 Avi Kivity 2009-05-24 15:35:32 EDT
I hate this bug.
Comment 100 Kyle McMartin 2009-05-24 18:21:30 EDT
http://koji.fedoraproject.org/koji/taskinfo?taskID=1375088

please test the new kernel available here.

Avi, the diff didn't apply as kvm_mmu_reset_context is split in git head... I hope the patch is still correct...?
Comment 101 Avi Kivity 2009-05-25 01:19:53 EDT
The patch is still correct (assuming it still adds the new lines to the end of kvm_set_cr4()).
Comment 102 Jens Petersen 2009-05-25 03:53:22 EDT
(In reply to comment #100)
> http://koji.fedoraproject.org/koji/taskinfo?taskID=1375088

Aha that seems to fix it for me! :)

I tested rawhide-i386 guest with above kernel.i586 host and haven't seen any sound crashes yet.

Maybe someone else can also confirm?
Comment 103 Julian Sikorski 2009-05-25 05:30:22 EDT
Looks like it works with F-10 x86_64 host as well. Congrats, Avi.
Comment 104 Avi Kivity 2009-05-25 05:57:14 EDT
Created attachment 345298 [details]
replacement for second patch

Attached patch replaces my previous second patch.  Should be functionally identical but adheres more closely to the spec.
Comment 105 Kyle McMartin 2009-05-25 09:20:10 EDT
http://koji.fedoraproject.org/koji/taskinfo?taskID=1375672

new scratch build with the replacement to the second patch. Let me know if this is the one we want to put in F-10/F-11.

cheers, Kyle
Comment 106 Avi Kivity 2009-05-25 12:47:33 EDT
It is.  Marcelo reviewed it, and I am going to upstream it shortly.

Thank you for flying this bug report, we hope you will select us again for your next crash.
Comment 107 Kyle McMartin 2009-05-25 13:53:22 EDT
[fwiw, I finally figured out that I did have a VT machine and could reproduce this. Confirmed it fixes it for me as well.]

Great, I've committed this for F-11, hopefully it's not too late to tag for release.

thanks for your help Avi, Jens, Julian.

Note You need to log in before you can comment on or make changes to this bug.