Bug 212625

Summary: QEMU always crashes
Product: Red Hat Enterprise Linux 5 Reporter: Stephen Tweedie <sct>
Component: kernel-xenAssignee: Don Zickus <dzickus>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: medium Docs Contact:
Priority: medium    
Version: 5.0CC: riel, xen-maint
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: 5.0.0 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2006-11-28 21:31:11 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 210422    
Bug Blocks:    

Description Stephen Tweedie 2006-10-27 19:14:21 UTC
+++ This bug was initially created as a clone of Bug #210422 +++

Description of problem:
Started running on kernel-xen (in Domain-0) and QEMU no longer works.
No kqemu used, qemu runs fully as a non-privileged user, just completely regular
process.
qemu ran in XEN domain on the same host with kernel-2.6.16 built from
linux-2.6-xen.hg works.
Both Domain-0 and the XEN domain run RawHide.i386.

Version-Release number of selected component (if applicable):
kernel-xen-2.6.18-1.2747.fc6.i686
xen-3.0.2-44.i386
qemu-0.8.2-3.fc6.i386

SDL-1.2.10-6.2.i386
alsa-lib-1.0.12-2.fc6.i386
glibc-2.5-3.i686
libX11-1.0.3-4.fc6.i386
libXau-1.0.1-3.1.i386
libXcursor-1.1.7-1.1.i386
libXdmcp-1.0.1-2.1.i386
libXext-1.0.1-2.1.i386
libXfixes-4.0.1-2.1.i386
libXrandr-1.1.1-3.1.i386
libXrender-0.9.1-3.1.i386

How reproducible:
Always.

Steps to Reproduce:
1. qemu -cdrom /dev/zero -net none -m 1
  
Actual results:
Could not open '/dev/kqemu' - QEMU acceleration layer not activated
[segv]

Expected results:
Could not open '/dev/kqemu' - QEMU acceleration layer not activated
[displayed window containing Bochs BIOS screen with failed boot]

Additional info:
core file etc. upon request but you should easily reproduce it yourself.
Not fully certain it is XEN specific but I use QEMU pretty often and it worked
last time on non-XEN kernel.

Program terminated with signal 11, Segmentation fault.
#0  cpu_x86_exec (env1=0x9d70998) at /usr/src/debug/qemu-0.8.2/cpu-exec.c:772
b772                    gen_func();
(gdb) bt
#0  cpu_x86_exec (env1=0x9d70998) at /usr/src/debug/qemu-0.8.2/cpu-exec.c:772
#1  0x08050968 in main_loop () at /usr/src/debug/qemu-0.8.2/vl.c:5069
#2  0x08051de3 in main (argc=1536, argv=0x0) at /usr/src/debug/qemu-0.8.2/vl.c:6221
Previous frame inner to this frame (corrupt stack?)

-- Additional comment from srostedt on 2006-10-16 21:55 EST --
I just tried this with

kernel-xen-2.6.18-1.2784.fc6
xen-3.0.2-44
qemu-0.8.2-3.fc6

And it worked for me.  Could you verify that the latest kernel-xen fixes this
problem?


-- Additional comment from jkratoch on 2006-10-17 14:09 EST --
Created an attachment (id=138700)
qemu -cdrom /dev/zero -net none -m 1

kernel-xen-2.6.18-1.2798.fc6.i686
xen-3.0.2-45.el5.i386
qemu-0.8.2-3.fc6.i386

It is sad you could not reproduce it.  Really running i386 (32-bit)?


-- Additional comment from jkratoch on 2006-10-19 14:00 EST --
It is workaroundable by
  echo 0 >/proc/sys/kernel/exec-shield
(still on that kernel-xen-2.6.18-1.2798.fc6.i686)
as suggested by Caolan McNamara in Bug 210748. Still not aware of the specific
cause but I assume you already know.


-- Additional comment from srostedt on 2006-10-20 21:55 EST --
No I didn't notice that this was for i386 only. You did mention that you were
using that, but I wasn't. So I was able to get it to seg fault.  OK, now that I
have something that doesn't work, I can take a closer look at it.  I also
switched this BZ to state that this is not for all hardware, but for i686.

-- Additional comment from srostedt on 2006-10-24 12:19 EST --
The fix for bz 200382 seems to have caused this bug. Will look into it further.

-- Additional comment from srostedt on 2006-10-25 10:26 EST --
OK, I've confirmed that the fix for 200382 caused this problem. I have a patch
that has already been submitted to the maintainers.  But I must first confirm
that the patch doesn't break 200382 before I close this.

Comment 1 RHEL Program Management 2006-10-30 17:20:24 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux major release.  Product Management has requested further
review of this request by Red Hat Engineering, for potential inclusion in a Red
Hat Enterprise Linux Major release.  This request is not yet committed for
inclusion.

Comment 2 Jay Turner 2006-10-31 12:28:19 UTC
QE ack for RHEL5B2.

Comment 3 Brian Stein 2006-10-31 14:24:47 UTC
*** Bug 212588 has been marked as a duplicate of this bug. ***

Comment 4 Stephen Tweedie 2006-11-01 13:51:01 UTC
OK, after a lot of testing last night, I've got some results:

First of all, I can no longer reproduce the bug reported in #200382.  Even using
that same pre-fc6 rawhide kernel, with a recompiled hypervisor to get past the
flood of timer messages that the original 1.2439 kernel+HV swamps the console
with, everything works fine.  I really don't know what precisely was the trigger
for hitting the exec_limit:=~0UL path, but I can't reproduce it now, although I
can definitely trigger the execshield GPF path in general.

But I can force that path to be taken by setting limit to -1 by force when this
path is taken.  Doing so, with the #200382 patch otherwise completely reverted,
I can reproduce exactly the GPF exec_limit=0xffffffff fixups that used to cause
problems:

#GPF fixup (0[seg:0]) at 00110918, CPU#1.
 exec_limit: ffffffff, user_cs: 0000ffff/00cffb00, CPU_cs: 000004f4/00c0fb00.

and this succeeds just fine, on a RHEL-5 kernel+hypervisor.

Furthermore, on the exact GPF infinite loops that we were getting before:

#GPF fixup (0[seg:0]) at 080c76e1, CPU#0.
 exec_limit: ffffffff, user_cs: 0000ffff/00cffb00, CPU_cs: 000067ff/00cffb00.

we were trying to change the limit to 0xfffff from 0xf67ff, with all other bits
of current and intended CS the same; so even the new proposed patch to test only
the limit bits wouldn't actually make the slightest difference to the problem
that was happening in bug 200382.

In short, I think we need to simply remove the
linux-2.6-xen-execshield-lazy-exec-limit.patch entirely --- the fixed form
cannot possibly fix the problem that it was initially generated for, and I have
tested that the unpatched RHEL-5 kernel is just as effective as the "fixed" one
at not crashing qemu/mono etc.

Comment 5 Suzanne Logcher 2006-11-01 19:44:05 UTC
Please change bugzilla status to POST once the removal of
linux-2.6-xen-execshield-lazy-exec-limit.patch is posted to rhkernel-list.

Comment 6 Don Zickus 2006-11-06 03:27:15 UTC
in kernel-2.6.18-1.2744.el5