Bug 212588 - SIGSEGV starting mono on xen kernel
SIGSEGV starting mono on xen kernel
Status: CLOSED DUPLICATE of bug 212625
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel-xen (Show other bugs)
5.0
All Linux
medium Severity medium
: ---
: ---
Assigned To: Steven Rostedt
:
Depends On: 210830
Blocks:
  Show dependency treegraph
 
Reported: 2006-10-27 12:44 EDT by Stephen Tweedie
Modified: 2007-11-30 17:07 EST (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2006-10-31 09:24:21 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Stephen Tweedie 2006-10-27 12:44:53 EDT
+++ This bug was initially created as a clone of Bug #210830 +++

Description of problem:
I have a dual CPU Athlon machine running the Xen kernel 2.6.
18-1.2784.fc6xen, with mono version 1.1.17.1-3 installed.  When I go to start
the mono binary, it immediately "Segmentation faults".  Looking at the core file
(with gdb /usr/bin/mono core.2608), I see the following:

(gdb) bt
#0  0x080794c0 in mono_arch_cpu_optimizazions ()
#1  0x0805bf5e in parse_optimizations ()
#2  0x0805c444 in mono_main ()
#3  0x0805bee2 in main ()

I recompiled mono from the source rpm, and ran it through gdb.  The call trace
actually ends up looking like this:

main (mono/mini/main.c)-> mono_main (mono/mini/driver.c) -> parse_optimizations
(mono/mini/driver.c) -> mono_arch_cpu_optimizazions (mono/mini/mini-x86.c) ->
cpuid (mono/mini/mini-x86.c, actually inlined).  In the cpuid function, there is
a set of assembly at the top, followed by the following code:

	if (have_cpuid) {
		/* Have to use the code manager to get around WinXP DEP */
		MonoCodeManager *codeman = mono_code_manager_new_dynamic ();
		CpuidFunc func;
		void *ptr = mono_code_manager_reserve (codeman, sizeof (cpuid_impl));
		memcpy (ptr, cpuid_impl, sizeof (cpuid_impl));

		func = (CpuidFunc)ptr;
		func (id, p_eax, p_ebx, p_ecx, p_edx);


The call to func at the end there is what actually causes the Segmentation
fault.  This seems to be the assembly copied from the cpuid_impl array, and it
all seems to be copied over properly, so I am not entirely sure what is wrong.

Version-Release number of selected component (if applicable):


How reproducible:
Every time

Steps to Reproduce:
1.  Start mono
  
Actual results:
Mono SEGFAULT

Expected results:
mono starts and runs normally

-- Additional comment from alexl@redhat.com on 2006-10-16 06:49 EST --
I wonder if this is due to the xen TLS optimization thing we did in 1.1.17.1-3. 
Have you tried earlier version and not seen this problem?
Does it go away if you remove the --with-xen_opt=yes from configure in the spec?


-- Additional comment from alexl@redhat.com on 2006-10-16 06:50 EST --
Also, are you running a 32bit or 64bit mono binary?


-- Additional comment from alexl@redhat.com on 2006-10-16 07:04 EST --
And, are you running inside a xen instance or outside?


-- Additional comment from clalance@redhat.com on 2006-10-16 09:15 EST --
Ah, I forgot to mention about that.  The with-xen_opt was my first idea, too.  I
actually rebuilt the rpm, and removed the "--with-xen_opt=yes", and I still got
the same behavior.  It's so early in the startup code, though, that it probably
doesn't make a difference.

This is the 32-bit mono binary, by the way, running in dom0.

-- Additional comment from clalance@redhat.com on 2006-10-18 20:58 EST --
Wow.  I think I may have figured this one out, I'm just not sure what to do
about it.  I knew mono used to work, so I decided to go back and try various
older fc6 mono versions.  None of them worked properly, which I thought quite
strange.  So then I went back and tried various kernels, since that was the next
likely.  I eventually found out it worked in 1.2716.fc6, but not 1.2723.fc6. 
Looking at a diff between these two versions shows that exec-shield options
changed in there, more specifically, making the stack non-executable.  Aha!  The
code in question (in mono/mini/mini-x86.c) allocates some memory, copies a bunch
of opcodes into that memory (from the cpuid_impl array), and then tries to call
that as a function.  So it is trying to execute (I think) code on the stack,
which exec-shield is disallowing.  Sure enough, setting
/proc/sys/kernel/exec-shield to 0 allows mono to start normally.  Incidentally,
as far as I can tell, this will affect *all* kernels after 2716, not just the
Xen ones.  Have you happened to test mono on a kernel after 2716?

So, that is the workaround.  The real crux of the problem, though, is mono doing
this crack with the executable code on the stack.  One simple solution might be
to take the cpuid_impl, turn it into a normal __asm__ __volatile__ code chunk,
and drop it into cpuid, but I'm not sure how palatable that is upstream, because
I don't know the reason they are doing this to begin with.  Do you have any
further ideas, and/or can you pursue this upstream?

Chris Lalancette

-- Additional comment from clalance@redhat.com on 2006-10-18 23:57 EST --
Created an attachment (id=138852)
Patch to not make cpuid try to run code on the stack

Well, here's a patch that basically does the CPUID call in a PIC friendly
manner, which happened to be the problem before.  This *seems* to do the right
thing, and doesn't execute code on the stack, which is always a good thing. 
One thing to note; I shamelessly stole this bit of assembly from kdemultimedia
(http://www.google.com/codesearch?q=+PIC+version+:+save+ebx+show:8hVzTkQrhro:yQNIAg1iuac:kWmVzQ9LyY8&sa=N&cd=1&ct=rc&cs_p=http://gentoo.osuosl.org/distfiles/kdemultimedia-3.5.3.tar.bz2&cs_f=kdemultimedia-3.5.3/mpeglib/lib/util/mmx/cpu_accel.c#a0).
 
kdemultimedia seems to be GPL, while Mono seems to be LGPL, so we might have to
resolve that/figure something else out, if we decide we want to fix this (and I
think we should).

Chris Lalancette

-- Additional comment from alexl@redhat.com on 2006-10-19 05:03 EST --
Its not running it from the stack. mono_code_manager_reserve() allocates the
memory dynamically. In this case i think it will use malloc() which we then call
mprotect() on to make it executable.

Maybe execshield is making it fail the mprotect() call? I dunno why I'm not
seeing this problem though. mono runs fine on my core duo with kernel 2.6.81-1.2726.

I don't think your patch is right though. The code manager is probably used in
many places in mono, and your change is just in one place. We should figure out
why it fails and fix it. 

Isn't there a way to flag a binary to allow heap execution with exec-shield?


-- Additional comment from caolanm@redhat.com on 2006-10-19 05:30 EST --
alex: maybe the example patch in openoffice.org of 

openoffice.org-2.0.3.oooXXXXX.selinux.bridges.patch

might be helpful for a similiar sounding scenario where I do the selinux dance
to avoid a simultaneously readable + writable executable memory segment

(http://people.redhat.com/drepper/selinux-mem.html)

-- Additional comment from alexl@redhat.com on 2006-10-19 05:35 EST --
I do have selinux on, and it works for me. Mono has its own selinux context that
allows it to do the stuff it needs. 

Chris: Do you have selinux on?


-- Additional comment from clalance@redhat.com on 2006-10-19 09:06 EST --
No, I don't have SELinux on.  Actually, once I found the workaround (echo 0 >
/proc/sys/kernel/exec-shield), I turned SELinux on just for a test, and ran into
the problem with the heap execution.  setroubleshootd told me I needed to set
the execheap boolean on.  In any case, that is a different problem since in my
original test I had SELinux disabled.

I'm not sure if it is relevant that I am using an Athlon machine?  Maybe some
difference in the way it protects the stack?  That is a pure guess.  I'll try
this out on another machine (intel) at work today, and let you know if I am more
successful there.

-- Additional comment from clalance@redhat.com on 2006-10-19 10:46 EST --
OK, test cases (all with SElinux off):

Opteron box, x86_64 kernel 2.6.18-2798, x86_64 mono 1.1.17-3: works
Opteron box, x86_64 kernel 2.6.18-2798xen, x86_64 mono 1.1.17-3: works
Opteron box, i686 kernel 2.6.18-2798, i686 mono 1.1.17-3: works
Opteron box, i686 kernel 2.6.18-2798xen, i686 mono 1.1.17-3: works
EM64T box, x86_64 kernel 2.6.18-2798, x86_64 mono 1.1.17-3: works
EM64T box, x86_64 kernel 2.6.18-2798xen, x86_64 mono 1.1.17-3: works
EM64T box, i686 kernel 2.6.18-2798, i686 mono 1.1.17-3: works
EM64T box, i686 kernel 2.6.18-2798xen, i686 mono 1.1.17-3: SEGFAULT
Athlon box, i686 kernel 2.6.18-2798xen, i686 mono 1.1.17-3: SEGFAULT
Pentium 4 box, i686 kernel 2.6.18-2798xen, i686 mono 1.1.17-3: SEGFAULT

OK, so I'm starting to get the pattern: only i686 kernels, only Xen.  The one
exception is the Opteron box running the i686 Xen kernel; that seems to work
fine.  There must be some difference in the way the Xen kernel handles the
exec-shield stuff, which is causing this problem, but I can't explain the
Opteron difference.  Now I am really confused.

-- Additional comment from alexl@redhat.com on 2006-10-23 03:24 EST --
This seems to be more of a xen kernel bug. Reassigning on advice from Dave Jones.
Comment 2 RHEL Product and Program Management 2006-10-30 12:01:55 EST
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux major release.  Product Management has requested further
review of this request by Red Hat Engineering, for potential inclusion in a Red
Hat Enterprise Linux Major release.  This request is not yet committed for
inclusion.
Comment 3 Jay Turner 2006-10-31 07:27:45 EST
QE ack for RHEL5B2.
Comment 4 Brian Stein 2006-10-31 09:24:21 EST

*** This bug has been marked as a duplicate of 212625 ***

Note You need to log in before you can comment on or make changes to this bug.